Professional Documents
Culture Documents
Statistics
Statistics
CHAPTER 4
Statistics
Topics:
4.1 Measures of Central
Tendency/Location
4.2 Measures of
Dispersion/Variation
4.3 Linear Correlation
and Simple Linear Regression
The word Statistics have two major definitions, a singular form and a plural form. Statistics, in a
plural sense, refers to the data itself or to some numerical computations derived from a set of data that are
systematically collected and analyzed. In a singular sense, Statistics refers to the scientific discipline
consisting of the theory and methods for processing collections of quantitative and qualitative data useful
when making decisions in the face of uncertainty.
Below are the objectives and some key definitions to be considered as you going through this
module.
Objectives:
(1) Calculate the mean, median and mode of a set of data and under what conditions they are most
appropriate to be used;
(2) Calculate the range, variance, and standard deviation;
(3) Plot a scatter diagram, measure and interpret the relationship between the two variables; and
(4) Predict or estimate values of dependent variable from known values of independent variables.
Key Definitions:
4.1.1. Mean
The mean (often called the average) is the most popular measure of central tendency. It is
the sum of a set of observations divided by the number of observations in the set. This measure
is appropriate for data in interval or ratio scale. The computing formulas of the mean are as
follows:
𝑋ത =
𝑘
∑ 𝑤𝑖
𝑖=1
Example 1. The number of hours spent by 12 students in studying their Statistics lesson
before exam were recorded as follows: 9, 11, 16, 11, 15, 12, 10, 16, 13, 11, 11, 17. Find
the arithmetic mean.
Solution: Since it was not mentioned that the data are random samples, we assume,
for the purpose of illustration, that this a population data. Thus
12
1 1
𝜇= ∑ 𝑥𝑖 = (𝑥 + 𝑥2 + … + 𝑥12 )
12 12 1
𝑖=1
1
= (9 + 11 + 16 + 11 + 15 + 12 + 10 + 16 + 13 + 11 + 11 + 17)
12
1 152
= (152) = = 12.67
12 12
This result shows that on the average, the 12 students spent 12.67 hours in
studying their Statistics lesson.
Example 3. The student’s final grades in Math 51, Math 43, GEE 12, GEC 19, PE31 and
NSTP 1 are 2.5, 2.75, 1.25, 1.75, 1.25 and 1.75, respectively. If the respective credits for
these subjects are 3, 4, 3, 3, 2, and 3 units, determine the student’s GPA or weighted
average grade.
Solution:
6
∑ 𝑤𝑖 𝑥𝑖
𝑖=1
𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥6
𝑋ത = =
6 𝑤1 + 𝑤2 + 𝑤3 + 𝑤4 + 𝑤5 + 𝑤6
35.25
= = 1.96
18
This result shows that the GPA of this student is 1.96.
4.1.2. Median
The median is the middle value of a set of observations arranged in an increasing or
decreasing order of magnitude, denoted by 𝑥̃. It is a positional value and unlike the arithmetic
mean, it is not affected by the presence of extreme values. When abnormal values or outliers
are present, it is preferable to use the median rather than the mean as a measure of central
location. It is an appropriate measure for data which are at least in the ordinal scale.
❖ Population Median
➢ If N is odd, then the median is computed using
𝑋̃ = 𝑥(𝑁+1)
2
2
❖ Sample Median
➢ If n is odd, then the median is computed using
𝑥̃ = 𝑥(𝑛+1)
2
Example 4. The ages of 8 CMU students enrolled in GEC 14 subject are: 18, 17, 23, 20,
19, 18, 21, and 22. Find the median of ages.
Solution: Arrange the ages in ascending order: 17, 18, 18, 19, 20, 21, 22, 23. This
means that 𝑥(1) = 17, 𝑥(2) = 18, 𝑥(3) = 18, 𝑥(4) = 19, 𝑥(5) = 20,
𝑥(6) = 21, 𝑥(7) = 22, 𝑥(8) = 23.
Since it was not mentioned that the data are random samples, we assume,
for the purpose of illustration, that this a population data. Also, N=8, which is an
even number, the median is
Example 5. The CMUCAT scores of a sample of 5 students who joined the university
during the first semester of SY 2020-2021 were found to be 78, 90, 89, 95, and 88.
Determine the median CMUCAT score.
Solution: Arrange the CMUCAT scores in ascending order: 78, 88, 89, 90, 95. This
means that 𝑥(1) = 78, 𝑥(2) = 88, 𝑥(3) = 89, 𝑥(4) = 90, 𝑥(5) = 95.
Since n=5, which is an odd number, the median is
𝑥̃ = 𝑥(𝑛+1) = 𝑥 5+1 = 𝑥(6) = 𝑥(3) = 89.
2 ( ) 2
2
Thus, the median is 89, which is the 3rd observation of the ordered data.
4.1.3. Mode
Mode is defined as the value which occur the greatest number of times or the value with
the greatest frequency. It is an appropriate measure for a nominal or categorical type of data.
Note: If observations occur with equal frequency then there is no modal value for the data
set.
Example 6. The CMUCAT scores of a sample of 5 students who joined the university
during the first semester of SY 2020-2021 were found to be 78, 90, 89, 95, and 88. Find
the mode CMUCAT score.
Solution: Since the observations occur with equal frequency then there is no modal value
for the data set.
Example 7. The number of hours spent by 12 students in studying their Statistics lesson
before exam were recorded as follows: 9, 11, 16, 11, 15, 12, 10, 16, 13, 11, 11, 17. Find
the mode.
Solution: The mode is 11 hours since it occurs four times while the other observations occur
only once or twice.
4.2.1. Range
Range is the difference between the highest value and the lowest value
𝑅 = 𝐻𝑉 − 𝐿𝑉
Example 8. The CMUCAT scores of a sample of 5 students who joined the university
during the first semester of SY 2020-2021 were found to be 78, 90, 89, 95, and 88. Find
the range of the CMUCAT score.
Solution: The highest CMUCAT score is 95 and the lowest CMUCAT score is 78; hence
the range is 17, that is,
𝑅 = 95 − 78 = 17.
Example 9. The number of hours spent by 12 students in studying their Statistics lesson
before exam were recorded as follows: 9, 11, 16, 11, 15, 12, 10, 16, 13, 11, 11, 17. Find
the range of the number of hours spent by 12 students in studying their Statistics lesson
before exam.
Solution: The highest value is 17 and the lowest value is 9; hence the range is 8, that is,
𝑅 = 17 − 9 = 8
4.2.2. Variance
Variance is another measure of variation which can be used instead of the range. The
variance considers the deviation of each observation from the mean. The computing formulas
are defined below.
➢ Population Variance
𝑁 𝑁
∑(𝑥𝑖 − 𝜇)2 ∑ 𝑥𝑖 2 − 𝑁𝜇 2
𝑖=1 𝑖=1
2
𝜎2 =
or 𝜎 =
𝑁 𝑁
where
𝜎 2 – population variance
𝜇 – population mean
𝑠 2 – sample variance
𝑥̅ – sample mean
12
∑(𝑥𝑖 − 𝜇)2
𝑖=1
(𝑥1 − 𝜇)2 + (𝑥2 − 𝜇)2 + (𝑥3 − 𝜇)2 + ⋯ + (𝑥12 − 𝜇)2
𝜎 =2 =
12
12
(9 − 12.67)2 + (11 − 12.67)2 + (16 − 12.67)2 + ⋯ + (17 − 12.67)2
=
12
∑(𝑥𝑖 − 𝑥̅ )2
𝑖=1
(𝑥1 − 𝑥̅ )2 + (𝑥2 − 𝑥̅ )2 + (𝑥3 − 𝑥̅ )2 + (𝑥4 − 𝑥̅ )2 + (𝑥5 − 𝑥̅ )2
2 =
𝑠 = 5−1
5−1
(78 − 88)2 + (90 − 88)2 + (89 − 88)2 + (95 − 88)2 + (88 − 88)2
=
5−1
(−10)2 + (2)2 + (1)2 + (7)2 + (0)2 100 + 4 + 1 + 49 + 0 154
= = = = 38.5
4 4 4
The sample variance (𝑠 2 ) is 38.5.
The Sample Pearson Correlation Coefficient can be interpreted in the following manner:
1. The value of r, ranges from -1 to +1. If r = +1 or r = -1, there is a perfect linear relationship and all
points lie in the straight line.
2. An r close to +1 indicates a high positive linear relationship between the two variables X and Y,
that is, if the value of X increases then the value of Y also increases.
3. An r close to -1 indicates a high negative linear relationship between the sample values, that is, the
value of X decreases as the value of Y increases.
4. An r near 0 means that there is a lack of linearity between the two variables, or there is no linear
relationship between them. This doesn’t mean they are not associated at all because the relationship
maybe nonlinear.
Scatter diagram is a graphical presentation of the independent variable (plotted on the horizontal
axis) and the dependent variable (plotted on the vertical axis). Through this graph or diagram is the easiest
way to determine if a relationship exists between the two variables.
Note: The correlation coefficient remains high (𝑟 ≈ ±1) value when the points cluster fairly around a
straight line (Figure 1 and Figure 2).
The Sample Coefficient of Determination, r 2 , is a number that determine the total variation in the
values of variable Y that can be accounted for or explained by the linear relationship with the values of the
variable X . It is usually expressed as a percentage. For example, if the correlation coefficient, r, is 0.60,
then 𝑟 2 = (0.60)2 = 0.36 = 36%. This means that 36% of the total variation of Y can be explained by its
linear relationship X.
Estimation of Parameters
Given the sample {( xi , yi ), i = 1, 2, 3, n} the least squares estimate of the parameters in the
regression line are:
𝑏̂ =
where 𝑏 is the regression coefficient or the slope of the regression line and 𝑎 is the constant of regression
or the y-intercept of the regression line. Moreover,
𝑛 𝑛
1 1
𝑦ത = ∑ 𝑦𝑖 𝑎𝑛𝑑 𝑥̅ = ∑ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1
are the means of the sample values of 𝑋 and 𝑌, respectively.
Example 14. A person’s muscle mass is expected to decrease with age. To explore this relationship, a
researcher randomly selected 10 persons from ages 40 to 79 years old and measured their muscle mass(unit).
The result is as follows:
X (age) 71 64 43 67 56 73 68 56 76 65
Y (muscle mass) 82 91 100 68 87 73 78 80 65 84
Based on the given data, do the following:
a. Plot the scatter diagram of the given data.
b. Find the sample coefficient of determination, 𝑟 2 and interpret the result.
c. Obtain the regression line equation.
d. Estimate the muscle mass when age of the person is 60 years old.
Muscle Mass
90
80
70
60
40 50 60 70 80
Age of a Person
10
∑ 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥10 = 71 + 64 + ⋯ + 65 = 639;
𝑖=1
10
∑ 𝑦𝑖 = 𝑦1 + 𝑦2 + ⋯ + 𝑦10 = 82 + 91 + ⋯ + 84 = 808;
𝑖=1
10
10(50887) − (639)(808)
𝑟= = −0.7961449318 ≈ −0.796,
√[10(41701) − 6392 ][10(66292) − 8082 ]
indicating a negative linear relationship between X (age of the person) and Y (muscle mass).
which means that 63% of the total variation of the muscle mass is explained or accounted
for by the age of the person.
10
∑ 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥10 = 71 + 64 + ⋯ + 65 = 639;
𝑖=1
10
∑ 𝑦𝑖 = 𝑦1 + 𝑦2 + ⋯ + 𝑦10 = 82 + 91 + ⋯ + 84 = 808;
𝑖=1
10
= −0.8564852112 ≈ −0.8565.
1. The following are the IQ scores of a random sample of 20 Senior High School Students enrolled
at CMU:
110 100 87 101 95 107 100 100 102 90
101 98 104 105 97 96 102 99 98 103
2. Consider the data below, where X is the number of hours spent in studying and Y is the exam
score
X 3 5 4 10 9 8 7 6 5 4 12 3
Y 30 54 40 90 85 82 78 68 60 48 96 35
1. The following are the IQ scores of a random sample of 20 Senior High School
Students enrolled at CMU:
110 100 87 101 95 107 100 100 102 90
101 98 104 105 97 96 102 99 98 103
Solutions:
Arranged the data in ascending order:
87, 90, 95, 96, 97, 98, 98, 99, 100, 100, 100, 101, 101, 102, 102, 103, 104, 105, 107, 110
and 𝑛 = 20
a. Mean
𝑛
1 1
𝑥̅ = ∑ 𝑥𝑖 = (87 + 90 + 95 + ⋯ + 110) = 99.75.
𝑛 20
𝑖=1
Hence, the average IQ scores of 20 Senior High School Students enrolled at CMU is 99.75.
b. Median
• Since n is even, then the median is
𝑥(𝑛) + 𝑥(𝑛+1) 𝑥(20⁄2) + 𝑥(20+1) 𝑥10 + 𝑥11
2
𝑥̃ = 2 2
= =
2 2 2
Since 𝑥10 = 100 𝑎𝑛𝑑 𝑥11 = 100, then
100+100 200
𝑥̃ = 2
= 2 = 100.
Thus, the median of the IQ scores of 20 SHS students is 100.
c. Mode
The value with the greatest frequency is 100 because it occurs three times.
𝑛 𝑛 2
2
𝑛 ∑ 𝑥𝑖 − ൭∑ 𝑥𝑖 ൱
𝑖=1 𝑖=1
𝑠2 =
𝑛(𝑛 − 1)
20(872 + 902 + 952 + ⋯ + 1102 ) − (87 + 90 + 95 + ⋯ + 110)2
= = 28.20.
20(20 − 1)
f. Range
𝑅 = 𝐻𝑉 − 𝐿𝑉 = 110 − 87 = 23
Hence, the range is 23.
2. Consider the data below, where X is the number of hours spent in studying and Y is the exam
score
X 3 5 4 10 9 8 7 6 5 4 12 3
Y 30 54 40 90 85 82 78 68 60 48 96 35
Find the following:
a. Plot the scatter diagram of the given data.
b. Find the sample coefficient of determination, 𝑟 2 and interpret the result.
c. Obtain the regression line equation.
d. Estimate the exam score when the number of hours spent in studying is 20 hours.
Solution:
a. The scatter diagram of the given data.
120
100
80
Exam Score
60
40
20
0
0 2 4 6 8 10 12 14
Number of Hours Spent in Studying
12
∑ 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥12 = 3 + 5 + ⋯ + 3 = 76;
𝑖=1
12
∑ 𝑦𝑖 = 𝑦1 + 𝑦2 + ⋯ + 𝑦12 = 30 + 54 + ⋯ + 35 = 766;
𝑖=1
12
∑ 𝑥𝑖 2 = 𝑥1 2 + 𝑥2 2 + ⋯ + 𝑥12 2 = 32 + 52 + ⋯ + 32 = 574 ;
𝑖=1
12
12(5544) − (76)(766)
𝑟= = 0.9596877969 ≈ 0.9597,
√[12(574) − 762 ][12(54518) − 7662 ]
indicating a negative linear relationship between X (age of the person) and Y (muscle mass).
which means that 92% of the total variation of the exam score (Y) can be explained by its
linear relationship with the number spent by studying (X).
c. To solve for the estimates b and a, we have the following given and computations:
𝑛 = 12;
𝑥1 = 3, 𝑥2 = 5, 𝑥3 = 4, 𝑥4 = 10, 𝑥5 = 9, 𝑥6 = 8, 𝑥7 = 7, 𝑥8 = 6, 𝑥9 = 5, 𝑥10 = 4, 𝑥11 = 12, 𝑥12 = 3;
𝑦1 = 30, 𝑦2 = 54, 𝑦3 = 40, 𝑦4 = 90, 𝑦5 = 85, 𝑦6 = 82, 𝑦7 = 78, 𝑦8 = 68, 𝑦9 = 60, 𝑦10 = 48, 𝑦11
= 96, 𝑦12 = 35;
12
∑ 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥12 = 3 + 5 + ⋯ + 3 = 76;
𝑖=1
12
∑ 𝑦𝑖 = 𝑦1 + 𝑦2 + ⋯ + 𝑦12 = 30 + 54 + ⋯ + 35 = 766;
𝑖=1
12
∑ 𝑥𝑖 2 = 𝑥1 2 + 𝑥2 2 + ⋯ + 𝑥12 2 = 32 + 52 + ⋯ + 32 = 574 ;
𝑖=1
First Semester 15 CMU Mathematics Department
Downloaded by Deniel Denamarca (deniedenamarca@gmail.com)
lOMoARcPSD|24995734
= 7.474820144 ≈ 7.4748
d. The exam score when the number of hours spent in studying in 20 hours is
̂𝒀 = 𝟏𝟔. 𝟒𝟗𝟐𝟖 + 𝟕. 𝟒𝟕𝟒𝟖(𝟐𝟎) = 𝟏𝟔𝟓. 𝟗𝟖𝟖𝟖 ≈ 𝟏𝟔𝟔
Reference: Supe, A., et. al., (2013). Elementary Statistics. Central Book Supply Inc.