Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

Department of Civil Engineering, Bayero University Kano

EGR4201: ENGINEERING
STATISTICS
Sampling and Frequency distribution

Dr B. W. Isah
Department of Civil Engineering
Bayero University Kano
Department of Civil Engineering, Bayero University Kano

• Sampling methods are the methods to choose


people/objects/items from the population to be
considered in a sample survey.
• Population: is the entire collection of objects or
outcomes about which information is sought.
• It is a complete set of all of all the items that
interest an investigator. A population of size N can
be very large. (even to infinity)
• A sample is a subset of a population, containing
the objects or outcomes that are actually
observed.
Department of Civil Engineering, Bayero University Kano

• Samples can be divided into two:


a) Probability samples - In probability samples,
each population element has a known probability
or chance of being chosen to form a sample.
b) Non-probability samples - In such samples,
samples, one can not be assured assured of
having known probility of each population
element.
Department of Civil Engineering, Bayero University Kano

• Probability sampling methods ensures that the


sample selected represents the population correctly
and the survey conducted is statistically valid.
Following are the types of probability sampling
methods:
I. Simple random sampling
II. Stratified sampling
III. Cluster sampling
IV. Multistage sampling
V. Systematic random sampling
Department of Civil Engineering, Bayero University Kano

• Simple random sampling: Let say the population


have N objects. The sample have n objects. In
random sampling, all possible samples of n objects
have equal chances of occurring.
• One example of simple random sampling is lottery
method. Assign each population element a unique
number and place the numbers in bowl. Mix the
numbers thoroughly.
• Simple random sampling can be use to select sample
for quality control test in a factory.
Department of Civil Engineering, Bayero University Kano

In this type of sampling method, method, population is


divided into groups called strata based on certain
common characteristics. Then samples are selected
from each group using simple random sampling
method and then survey is conducted on
people/object/items of those samples.
In this method, the groups are homogenous and fewer
elements are randomly selected from each group.
Example. Making a sample of the sample type of bolts
and nuts from large population produced by different
machines in a factory. The strata can be bolt and nuts
produced by the same machine.
Department of Civil Engineering, Bayero University Kano

• In Cluster sampling, each population member is


assigned to a unique group called cluster . A sample
cluster is selected using simple random sampling
method and then survey is conducted on people,
objects or items of that sample cluster. Cluster
sampling are heterogeneous in nature within group,
and are chosen randomly.
• Examples A cluster can be in a form of committee
comprising members from different departments, If
from each cluster which has been randomly chosen,
few elements are chosen randomly using simple
random sampling or any other probability method then
it is a two stage cluster sampling.
Department of Civil Engineering, Bayero University Kano

• In such case, combination of different sampling


methods at different stages. For example, at first stage,
cluster sampling can be used to choose clusters from
population and then simple random sampling can be
used to choose elements from each cluster for the final
sample.
• Example: A cluster sampling can be in a form of
committee comprising members from different
departments, then simple random sampling can be
used to select members from each cluster in order to
form the final sample.
Department of Civil Engineering, Bayero University Kano

• In this type of sampling method, every member of


population is listed and then first sample element is
randomly selected from first k element. Thereafter,
every kth element is selected from the list.
Department of Civil Engineering, Bayero University Kano

• Non-probability sampling methods are convenient and


cost-saving. But they do not allow to estimate the
extent to which sample statistics are likely to vary from
population. parameters. Whereas probability sampling
methods allows that kind of analysis.
• Following are the types of non-probability sampling
methods:
a) Voluntary sample
b) Convenience sample
Department of Civil Engineering, Bayero University Kano

• Voluntary sample - In this method, interested people


are asked to get involved in a voluntary survey. A good
example of voluntary sample in on-line poll of a news
show or product where viewers or users are asked to
participate. In such sample, viewers/users choose the
sample, not the one who conducts survey.
• Convenience sample: In such sampling methods,
surveyor picks people who are easily available to give
their inputs. For example, a surveyor chooses a cinema
hall to survey movie viewers. If the cinema hall was
selected on the basis that it was easier to reach then it
is a convenience sampling method
Department of Civil Engineering, Bayero University Kano

Nominal: Measures categories. We can’t perform math


on the data. Example: male and female, boy and girl,
black and white etc.
Ordinal: Measures categories that have orders.
Example: positioning in your class.
Interval: Measures data where distances between
consecutive numbers have meaning and the data is
always numerical. Ruler scale. Temperature scale.
Ratio: Measures the relation between two numbers.
Example the ration of female students in L400.
Department of Civil Engineering, Bayero University Kano

Variables or data falls into two places- quantitative and


qualitative.
a) Qualitative: Measures with nominal or ordinal scale
b) Quantitative: Measures with interval or ration.
• Quantitative can further be divided into 2:
i. Discrete: Are measured with an integer. Example:
Number of workers in a particular site.
ii. Continues: Are measured on a number line.
Example: data consisting the height of all L400
students. In may take several decimals.
Quantitative data can be presented in Tabular form (
frequency distribution table) or graph
Department of Civil Engineering, Bayero University Kano

• Defending on the data type, it can be tabulated using


any of the following:
a) Frequency distribution table
b) Stem and Leaf
c) Cross tabulation
• Frequency distribution table: Quantifies the frequency
with which each category occur in a sample of a data.
• -It can be represented in tally or the final frequency.
Department of Civil Engineering, Bayero University Kano

How to choose a class number


• Ceiling function (number of classes): Assuming you
have a data with sample size represented by n. Take
the square root of n and round it to the nearest integer.
• Class width= (max obs-min obs)/number of classes
Example 2.0. Record of time taken to prepare cement
order of 75 customers is give below.
• Number of class = 75 = 8.66 ≈9
20.3 − 4.1
• Class width = =1.8
9
Department of Civil Engineering, Bayero University Kano

Class Freq. Com. freq


4.1-5.9 11 11
5.9-7.7 9 20
7.7-9.5 12 32
9.5-11.3 8 40
11.3-13.1 10 50
13.1-14.9 8 58
14.9-16.7 10 68
16.7-18.5 4 72
18.5-20.3 3 75
Department of Civil Engineering, Bayero University Kano

These simple displays are particularly suitable for


exploratory analysis of fairly small sets of data. The basic
ideas will be developed with an example.
• Ideal for displaying a trend
• Ideal for small data
Example 2.2: A score of L400 students in Statistics is
presented below. Use stem and leaf to represent the data
80 71 52 65 69 96 87 93 79 71
61 72 95 76 50 79 92 81 86 68
83 92 77 64 98 57 85 71 72 87
Department of Civil Engineering, Bayero University Kano

Stem Leaf Freq.


5 207 3
6 59184 5
7 191269712 9
8 071 357 3
9 63228 5

Example 4.1 Data have been obtained on the lives of batteries of a


particular type in an industrial application. Table 1.1 shows the
lives of 36 batteries recorded to the nearest tenth of a year.
Table 1.1: Battery Lives, years
4.1 5.2 2.8 4.9 5.6 4.0 4.1 4.3 5.4 4.5 6.1 3.7 2.3 4.5 4.9 5.6 4.3 3.9
3.2 5.0 4.8 3.7 4.6 5.5 1.8 5.1 4.2 6.3 3.3 5.8 4.4 4.8 3.0 4.3 4.7 5.1
Department of Civil Engineering, Bayero University Kano

Stem Leaf Frequency


1 8 1
2 38 2
3 023779 6
4 0112333455678899 16
5 011245668 9
6 13 2
Department of Civil Engineering, Bayero University Kano

Allow us to see the frequencies of the observations that fall


at the intersection of two variables. Especially good for
qualitative variables or discretize quantities.
Example: Lightening condition of a particular floor of a
company.
•Variable 1

Not sat Very sat


1 2 3 4 5
Department of Civil Engineering, Bayero University Kano

•Variable 2 age of the operators


Under 30 30-50 0ver 50
Operator Satisfaction level Age
1 4 53
2 3 37
3 1 24
. 4 28
. . .
77 3 51
Sat level Under 30 30-50 Over 50 Total
1 7 3 0 10
2 19 14 3 36
3 28 17 12 57
4 11 22 16 49
5 2 9 14 25
Total 67 65 45 177
Department of Civil Engineering, Bayero University Kano

As an example, The data in Table 1.3 concern the geyser Old


Faithful in Yellowstone National Park. This geyser alternates
periods of eruption, which typically last from 1.5 to 4 minutes, with
periods of dormancy, which are considerably longer. Table 1.3
presents the durations, in minutes, of 60 dormant periods. The list
has been sorted into numerical order.
TABLE 1.3 Durations (in minutes) of dormant periods of the
geyser Old Faithful
42 45 49 50 51 51 51 51 53 53
55 55 56 56 57 58 60 66 67 67
68 69 70 71 72 73 73 74 75 75
75 75 76 76 76 76 76 79 79 80
80 80 80 81 82 82 82 83 83 84
84 84 85 86 86 86 88 90 91 93
Department of Civil Engineering, Bayero University Kano
Department of Civil Engineering, Bayero University Kano

Number of of classes
may be log2 n or 2n1/3 in
histogram

Histogram tells us
• How are data distributed?
• Do any class stand out?
• Is the distribution symmetrical?
• Does the distribution skewed in one direction or not?
Scatter plot: Represents 2 numerical variables
Department of Civil Engineering, Bayero University Kano
Department of Civil Engineering, Bayero University Kano
Department of Civil Engineering, Bayero University Kano
Department of Civil Engineering, Bayero University Kano
Department of Civil Engineering, Bayero University Kano

Tells us how many observations fall below a particular


value on a horizontal axis
Department of Civil Engineering, Bayero University Kano
Department of Civil Engineering, Bayero University Kano

Represents 2 numerical variables


Operators are paid partially according to polish piece
rate system
Department of Civil Engineering, Bayero University Kano
Department of Civil Engineering, Bayero University Kano

45 50 65 77 95 83 74 65 80 58 68 90 88 68 55 79 83
72 74 76 Construct the F table using 5 class width.

Class f Class com


Bondry midpoint
44.5-55.5 4 50 4
55.5-66.5 5 61 9
66.5-77.5 6 72 15
77.5-88.5 7 83 22
88.5-99.5 3 94 25
Department of Civil Engineering, Bayero University Kano

A Dotplot is a graph that can be used to give a rough


impression of the shape of a sample. It is useful when
the sample size is not too large and when the sample
contains some repeated values.
Examples: The weather in Kano is dry most of the time, but
it can be quite rainy in the winter. The rainiest month of the
year is August. The following table presents the annual
rainfall in Kano, in inches, for each August from 1982 to
2023.
0.2 3.7 1.2 13.7 1.5 0.2 1.7 0.6 0.1 8.9 1.9 5.5 0.5 3.1 3.1 8.9 8.0
12.7 4.1 0.3 2.6 1.5 8.0 4.6 0.7 0.7 6.6 4.9 0.1 4.4 3.2 11.0 7.9
0.0 1.3 2.4 0.1 2.8 4.9 3.5 6.1 0.1
a. Construct a stem-and-leaf plot for these data.
b. Construct a histogram for these data.
c. Construct a dotplot for these data
Department of Civil Engineering, Bayero University Kano

Forty-five specimens of a certain type of powder were


analyzed for sulfur trioxide content. Following are the
results, in percent. The list has been sorted into
numerical order.
14.1 14.4 14.7 14.8 15.3 15.6 16.1 16.6 17.3 14.2 14.4
14.7 14.9 15.3 15.7 16.2 17.2 17.3 14.3 14.4 14.8 15.0
15.4 15.7 16.4 17.2 17.8 14.3 14.4 14.8 15.0 15.4 15.9
16.4 17.2 21.9 14.3 14.6 14.8 15.2 15.5 15.9 16.5 17.2
22.4
a. Construct a stem-and-leaf plot for these data.
b. Construct a histogram for these data.
c. Construct a dotplot for these data.
Department of Civil Engineering, Bayero University Kano

• The sample mean is also called the “arithmetic


mean,” or, more simply, the “average.” It is the sum of
the numbers in the sample, divided by how many
there are. 𝑛
1
• The Sample mean is 𝑋= 𝑋𝑖
𝑛
𝑖=1

• A sample of five men chosen at random from a large


population of men, and their heights are measured as
follows: 65.1, 72.3 68.31, 67.05 and 70.68. find the
mean 1
𝑋= 65.51 + 72.30 + 68.31 + 67.05 + 70.68 = 68.77 𝑖𝑛
5
Department of Civil Engineering, Bayero University Kano

• The standard deviation is a quantity that measures


the degree of spread in a sample. When the spread
is large, the sample values will tend to be far from
their mean, but when the spread is small, the values
will tend to be close to their mean.
• EX: Consider the following list of numbers: 28, 29,
30, 31, 32 and 10, 20, 30, 40, 50. both have the
same mean of 30.
• Let X1,..., Xn be a sample. The sample variance is
the quantity 𝑛
1
𝑠2 = (𝑋𝑖 − 𝑋 )2
𝑛−1
𝑖=1
Department of Civil Engineering, Bayero University Kano

• Let X1,..., Xn be a sample. The sample variance is


the quantity
𝑛
2
1
𝑠 = (𝑋𝑖 − 𝑋 )2
𝑛−1
𝑖=1
2 1 𝑛 2
• Another formula 𝑠 = 𝑖=1(𝑋𝑖 −𝑛 𝑋2)
𝑛−1

The sample standard deviation


𝑛
1 𝑛 1 2
s= 𝑖=1(𝑋𝑖 −𝑋 )2 𝑠= (𝑋𝑖 −𝑛 𝑋 2 )
𝑛−1 𝑛−1
𝑖=1
Department of Civil Engineering, Bayero University Kano

• Compute the Standard deviation of the above data


The sample mean is 𝑋 = 68.77
• The sample variance is therefore
1
𝑠 = [(65.51 − 68.77)2 + (72.30 − 68.77) 2 + (68.31 −
2
4
68.77) 2 + (67.05 − 68.77) 2 + (70.68 − 68.77) 2 ] =
7.47665
Alternatively:
1
𝑠2 = [65.512 + 72.302 + 68.312 + 67.052 + 70.682 −
4
5(68.772 )] = 7.47665
The sample standard deviation s = √ 7.47665 = 2.73
Department of Civil Engineering, Bayero University Kano

• If the heights were measured in centimeters rather


than inches? Let’s denote the heights in inches by
X1, X2, X3, X4, X5, and the heights in centimeters by
Y1, Y2, Y3, Y4, Y5. The relationship between Xi and
Yi is then given by Yi = 2.54Xi.
• The sample mean will also be related as 𝑌 = 2.54𝑋
• Deviations (Yi − 𝑌 ) = 2.54(Xi − 𝑋)
• Therefore, 𝑠𝑦2 = 2.542 𝑠𝑥2 , 𝑠𝑌 = 2.54𝑠𝑋
Department of Civil Engineering, Bayero University Kano

Outliers are few points in a data that are much larger


or smaller than the rest. Sometimes outliers result
from data entry errors. E.g. a misplaced decimal
point in a data point.
Department of Civil Engineering, Bayero University Kano

• The median, like the mean, is a measure of


center. To compute the median of a sample, order
the values from smallest to largest. The sample
median is the middle number.
• If n is odd, the sample median is the number in
𝑛+1
position
2
• If n is even, the sample median is the average of
𝑛 𝑛
the numbers in positions 𝑎𝑛𝑑 + 1.
2 2
EX: Using the five heights data above, arranged in
increasing order, are 65.51, 67.05, 68.31, 70.68,
72.30. The sample median is 68.31
Department of Civil Engineering, Bayero University Kano

• The median is often used as a measure of center


for samples that contain outliers. To see why,
consider the sample consisting of the values 1, 2,
3, 4, and 20. The mean is 6, and the median is 3.
It is reasonable to think that the median is more
representative of the sample than the mean is.
Department of Civil Engineering, Bayero University Kano

• Like the median, the trimmed mean is a measure of


center that is designed to be unaffected by outliers.
• The trimmed mean is computed by arranging the
sample values in order, “trimming” an equal number of
them from each end, and computing the mean of
those remaining. If p% of the data are trimmed from
each end, the resulting trimmed mean is called the
“p% trimmed mean.”
• Median is an extreme form of trimmed mean
• Number of data points trimmed must be a whole
number.
• Trimmed mean is given by np /100
Department of Civil Engineering, Bayero University Kano

EX: In the article “Evaluation of Low-Temperature


Properties of HMA Mixtures” (P. Sebaaly, A. Lake,
and J. Epps, Journal of Transportation Engineering,
2002: 578–583), the following values of fracture
stress (in megapascals) were measured for a
sample of 24 mixtures of hot-mixed asphalt (HMA).

30 75 79 80 80 105 126 138 149 179 179 191 223


232 232 236 240 242 245 247 254 274 384 470
Compute the mean, median, and the 5%, 10%, and
20% trimmed means
Department of Civil Engineering, Bayero University Kano

• The average of all the 24 numbers, = 195.42.


• The median is the average of the 12th and 13th
numbers, which is (191 + 223)/2 = 207.00.
• To compute the 5% trimmed mean, we must drop 5%
of the data from each end.
• This comes to (0.05)(24) = 1.2 observations. We
round 1.2 to 1, and trim one observation off each end.
The 5% trimmed mean is the average of the
remaining 22 numbers:
75 + 79 +···+ 274 + 384
= 190.45
22
Department of Civil Engineering, Bayero University Kano

• To compute the 10% trimmed mean, round off


(0.1)(24) = 2.4 to 2.
• Drop 2 observations from each end, and then average
the remaining 20:
79 + 80 +···+ 254 + 274
= = 186.55
20
For 20% trimmed mean, round off (0.2)(24) = 4.8 to 5.
Drop 5 observations from each end, and then average
the remaining 14:
105 + 126 +···+ 242 + 245
= = 194.07
14
Department of Civil Engineering, Bayero University Kano

• The sample mode is the most frequently occurring


value in a sample.
EX: compute the mode of the following data:
30 75 79 80 80 105 126 138 149 179 179 191 223
232 232 236 240 242 245 247 254 274 384 470
There are three modes: 80, 179, and 232. Each of
these values appears twice, and no other value
appears more than once.
Department of Civil Engineering, Bayero University Kano

• The median divides the sample in half. Quartiles divide


it as nearly as possible into quarters.
• A sample has three quartiles.
• The simplest method of computing the quartiles is as
follows:
a) Order the sample values from smallest to largest.
b) First quartile = 0.25(n + 1). If this is an integer, then
the sample value in that position is the first quartile. If
not, then take the average of the sample values on
either side of this value.
c) The third quartile = 0.75(n +1)
d) The second quartile = 0.5(n + 1)
e) The second quartile is identical to the median
Department of Civil Engineering, Bayero University Kano

• Find the first and third quartiles of the asphalt data


above.
Solution
• The sample size is n = 24.
• first quartile, = (0.25)(25) = 6.25. The first quartile is
therefore found by averaging the 6th and 7th data
points, when the sample is arranged in increasing
order. This yields (105 + 126)/2 = 115.5.
• To find the third quartile, compute (0.75)(25) = 18.75.
We average the 18th and 19th data points to obtain
(242 + 245)/2 = 243.5.
Department of Civil Engineering, Bayero University Kano

• The pth percentile of a sample, for a number p


between 0 and 100, divides the sample so that as
nearly as possible p% of the sample values are less
than the pth percentile, and (100 − p)% are greater.
a) Order the sample values from smallest to larges
b) Then compute the quantity (p/100)(n + 1), where n is
the sample size
c) If this quantity is an integer, the sample value in this
position is the pth percentile. Otherwise average the
two sample values on either side.
d) First quartile is the 25th percentile.
e) The median is the 50th percentile
f) The third quartile is the 75th percentile
Department of Civil Engineering, Bayero University Kano

• Percentiles are often used to interpret scores on


standardized tests.
• For example, if a student is informed that her score on
a college entrance exam is on the 64th percentile, this
means that 64% of the students who took the exam
got lower scores
• Find the 65th percentile of the asphalt data above.
Solution
• The sample size is n = 24. To find the 65th percentile,
compute (0.65)(25) = 16.25. The 65th percentile is
therefore found by averaging the 16th and 17th data
points, when the sample is arranged in increasing
Department of Civil Engineering, Bayero University Kano

• A boxplot is a graphic that presents the median,


the first and third quartiles, and any outliers that
are present in a sample.
Department of Civil Engineering, Bayero University Kano

• Box plots are particularly suitable for comparing sets


of data. also to identify and delete outliers (errors) in a
data
Department of Civil Engineering, Bayero University Kano

• The interquartile range IQR is the difference


between the third quartile and the first quartile
• Note that since 75% of the data is less than the
third quartile, and 25% of the data is less than the
first quartile, it follows that 50%, or half, of the data
are between the first and third quartiles.
• The interquartile range is therefore the distance
needed to span the middle half of the data
• We have defined outliers as points that are
unusually large or small.
Department of Civil Engineering, Bayero University Kano

• If IQR represents the interquartile range, then for


the purpose of drawing boxplots, any point that is
more than 1.5 IQR above the third quartile, or
more than 1.5 IQR below the first quartile, is
considered an outlier. Some texts define a point
that is more than 3 IQR from the first or third
quartile as an extreme outlier.
Department of Civil Engineering, Bayero University Kano

■ Compute the median and the first and third quartiles


of the sample. Indicate these with horizontal lines. Draw
vertical lines to complete the box.
■ Find the largest sample value that is no more than 1.5
IQR above the third quartile, and the smallest sample
value that is no more than 1.5 IQR below the first
quartile. Extend vertical lines (whiskers) from the
quartile lines to these points.
■ Points more than 1.5 IQR above the third quartile, or
more than 1.5 IQR below the first quartile, are
designated as outliers. Plot each outlier individually.

You might also like