Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Chapter 4:

Data
6
4
2 Series 1

Management
0 Series 2
Series 3

4
Sales 3
2
1
0
0 1 2 3

Module Overview
In today’s world we process enormous amount of data almost every day. In schools,
laboratories, and companies, volumes of data are processed. Data management plays an
important role in processing this data. To help analyze a certain phenomenon, we need to
manage data with the help of statistics. When the data are managed efficiently, it results to
understanding the nature of such phenomenon. This will further improve the lives in the modern
world. Obtaining and presenting data is a very challenging job. It demands a careful way of
handling it. A person who collects data must be very precise, accurate, and open-minded
creative in presenting the data he obtained.

Objectives:

Upon completion of this module, the students should be able to:


1. Construct a frequency distribution table using Rule 1 and Rule 2.
2. Determine the appropriate graph to be used in the given situations.
3. Use a variety of statistical tools to process and manage numerical data.
4. Advocate the use of statistical data in making important decisions.
5. Solve the different measures of Central Tendency.
6. Analyze and interpret the data presented in the table using measures of central tendency
7. Compute the measures of dispersion.
8. Solve for the measures of relative position.
9. Use linear regression to predict the value of a variable given certain conditions.
10. Apply correlation to determine the relationship between two variables.
11. Use variety of statistical tools to process and manage numerical data.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


1
Lesson 1 Frequency Distribution
Learning Outcomes:
At the end of this lesson, you must have:
a) understand the different terms related to organization of data
b) construct a frequency distribution table using Rule 1 and Rule 2
c) graphing statistical data efficiently and correctly

CATCH IT
How do we make a frequency distribution table?
What is the purpose of frequency distribution table?
How do we find the midpoint in the frequency distribution table?

CONCEPTUALIZE
A. Organization of Data
When conducting a statistical research, investigation or study, the research must
gather data for the particular variable under investigation. To describe situations, make
conclusions, and draw inferences about events, the researcher must organize the data gathered
in some meaningful ways. The easiest way and widely used of organizing data is to construct a
frequency distribution. A frequency distribution is the grouping of the data into categories
showing the number of observations in each of the non- overlapping classes.
After organizing data, the next move of the researcher is to present the data so they can
be understood easily by those who will benefit from reading the study. The most useful method
of presenting data is by constructing graphs and charts. There are number of ways to plot
graphs and charts, and each one has a specific purpose.
This section discussed how to organize data by constructing frequency distribution and
how to present data by constructing graphs and charts. Before we get started in constructing
frequency distribution, we must define some terms that are essential to understand deeper the
nature of data that are displayed in a frequency distribution.

 Raw data is the data collected in original form.


 Range is the difference of the highest value and the lowest value in the
distribution.
 Frequency distribution is the organization of data in a tabular form, using
mutually exclusive classes showing the number of observations in each.
 Class limits (or apparent limits) is the highest and lowest values describing a
class.
 Class Boundaries is the upper and lower values of a class for a group frequency
distribution whose values has additional decimal place more than the class limits
and end with the digit 5.
 Interval (or width) is the distance between the class lower boundary and the
class upper boundary and it is denoted by the symbol i.
 Frequency (f) is the number of values in a specific class of a frequency
distribution.
 Cumulative frequency (cf) is the sum of the frequencies accumulated up to the
upper boundary of a class in a frequency distribution.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


2
 Midpoint is the point halfway between the class limits of each class and is
representative of the data within that class.
 A grouped frequency distribution is used when the range of the data set is
large; the data must be grouped into classes whether it is categorical data or
interval data. For interval data the class is more than one unit in width. The
procedure for constructing the frequency distribution

 Categorical Frequency Distribution

The categorical frequency distribution is used to organize nominal-


level or ordinal-level type of data. Some examples where we can apply this
distribution are gender, business type, political affiliation and others.

Example 1: Twenty applicants were given a performance evaluation appraisal.


The data set is

High High High Low Average


Average Low Average Average Average
Low Average Average High High
Low Low Average High High

Construct a frequency distribution for the data.


Solution:
Step 1: Construct a table as shown below.
Class Tally Frequency Percentage

High

Average

Low

Step 2: Tally the raw data.


Class Tally Frequency Percentage

High IIII-II

Average IIII-III

Low IIII

Step 3: Convert the tallied data into numerical frequencies.

Class Tally Frequency Percentage

High IIII-II 7

Average IIII-III 8

Low IIII 5

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


3
Step 4:

Determine the percentage. The percentage is computed using the formula:


𝑓
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 = 𝑥 100%
𝑛
Where 𝑓 = frequency of the class and 𝑛 = total number of values.

Class Tally Frequency Percentage Found by

High IIII-II 7 35 (7 ÷ 20)𝑥 100

Average IIII-III 8 40 (8 ÷ 20)𝑥 100

Low IIII 5 25 (5 ÷ 20)𝑥 100

Total 20 100

For the sample, more applicants received an average performance rating.

 Determining Class Interval

Generally, the number of classes for a frequency distribution table varies from 5
to 20, depending primarily on the number of observations in the data set. It is preferred to have
more classes as the size of the data set increases. The decision about the number of classes
depends on the method used by the researcher.

1. Rule 1. To determine the number of classes is to use the smallest positive integer 𝑘
such that 2𝑘 ≥ 𝑛, where 𝑛 is the total number of observations.
𝑅𝑎𝑛𝑔𝑒 𝐻𝑉 − 𝐿𝑉
𝑆𝑢𝑔𝑔𝑒𝑠𝑡𝑒𝑑 𝐶𝑙𝑎𝑠𝑠 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑘
Where: 𝐻𝑉 = Highest value in a data set
𝐿𝑉 = Lowest value in a data set
𝑘 = number of classes
𝑖 = suggested class interval

2. Rule 2. Another way to determine the class interval is by applying the formula:
𝑅𝑎𝑛𝑔𝑒
𝑆𝑢𝑔𝑔𝑒𝑠𝑡𝑒𝑑 𝐶𝑙𝑎𝑠𝑠 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 =
1 + 3.22 (𝑙𝑜𝑔𝑎𝑟𝑖𝑡ℎ𝑚 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑖𝑒𝑠

 Grouped Frequency Distribution


Example 2:
Suppose a researcher wished to do a study on the monthly salary of young professionals
of selected companies in Makati City. The research first would have to collect the data by asking
each young professional about his monthly salary. The data collected in original form is called
raw data. In this case the data are

17,400 32,400 20,200 21,300 26,200 22,750 24,600 27,300 23,500 29,500
14,000 30,500 17,950 20,250 24,750 21,750 23,700 26,500 22,900 27,500
15,500 30,700 18,400 20,400 25,000 21,900 23,850 26,800 23,000 27,800
17,300 32,100 20,000 21,000 26,100 22,600 24,500 27,000 23,400 29,300
15,700 30,700 18,700 20,500 25,150 21,900 24,100 26,900 23,200 27,900
14,300 30,650 18,350 20,300 25,000 21,800 23,700 26,500 22,900 27,600
17,000 30,750 18,800 20,800 26,000 22,000 24,300 27,000 23,400 27,900
17,800 33,500 20,250 21,600 26,300 22,800 24,700 27,400 23,700 30,400

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


4
Construct a frequency distribution using 2𝑘 Rule and determine the following:
a. Range e. Percentages
b. Interval f. Cumulative frequencies
c. Class limits g. Midpoints
d. Relative frequencies

Solution:
Step 1: Arrange the raw data in ascending or descending order. In this particular example
we will arrange raw data in ascending order. This will make it easier for us to tally the data.

14,000 17,950 20,250 21,750 22,900 23,700 24,750 26,500 27,500 30,500
14,300 18,350 20,300 21,800 22,900 23,700 25,000 26,500 27,600 30,650
15,500 18,400 20,400 21,900 23,000 23,850 25,000 26,800 27,800 30,700
15,700 18,700 20,500 21,900 23,200 24,100 25,150 26,900 27,900 30,700
17,000 18,800 20,800 22,000 23,400 24,300 26,000 27,000 27,900 30,750
17,300 20,000 21,000 22,600 23,400 24,500 26,100 27,000 29,300 32,100
17,400 20,200 21,300 22,750 23,500 24,600 26,200 27,300 29,500 32,400
17,800 20,250 21,600 22,800 23,700 24,700 26,300 27,400 30,400 33,500

Step 2: Determine the classes.

 Find the highest and lowest value.


Highest Value (HV) = 32,840 and Lowest Value (LV) = 14,000

 Find the range.


Range = Highest Value (HV) − Lowest Value (LV) = 33,500 − 14,000 = 19,500

 Determine the number of classes.

The objective is to use just enough classes. We can determine the number of classes (𝑘)
using "𝟐 𝒕𝒐 𝒕𝒉𝒆 𝒌 𝒓𝒖𝒍𝒆”. This will enable us to select the smallest number (𝑘) for the number
of classes such that 2𝑘 (2 𝑟𝑎𝑖𝑠𝑒𝑑 𝑡𝑜 𝑡ℎ𝑒 𝑝𝑜𝑤𝑒𝑟 𝑜𝑓 𝑘) is greater than the number of observation
(𝑛).
Using our example, there are 80 young professionals(𝑜𝑟 𝑛 = 80). If we apply 𝑘 = 6,
which means we would use 6 classes, then 2𝑘 = 26 = 64, somewhat less than 80. Thus, 6 is not
enough classes. If we try 𝑘 = 7, then 2𝑘 = 27 = 128, this is greater than 80. Therefore, the
recommended number of classes is 7.

 Determine the class interval (or width).

Generally, the class interval (or width) should be equal for all classes. The classes must
cover all the values in the raw data (that is, from lowest to highest). Class interval is generated
using the formula:

𝑅𝑎𝑛𝑔𝑒 𝐻𝑉 − 𝐿𝑉 19,500
𝑆𝑢𝑔𝑔𝑒𝑠𝑡𝑒𝑑 𝐶𝑙𝑎𝑠𝑠 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = = = = 2,785.7 ≈ 2,800
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑘 7

Note: Round the value of the interval up to the nearest whole number if
there is a remainder.

 Select a starting point for the lowest class limit.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


5
The starting point can be the smallest data value or any convenient number less than the
smallest data value. In our case 14,000 is used.

 Set the individual class limit.

We need to add the interval (or width) to the lowest score taken as the starting point to
obtain the lower limit of the next class. Keep adding until we reach the 7 classes, as reflected
14,000; 16,800; 19,600; 22,400; 25,200; 28,000 and 30,800.

To obtain the upper class limits, we first need to add the interval to the lower limit of the
class to obtain the upper limit of the first class. That is, 14,000 + 2,800 = 16,800. Then, add
the interval (or width) to each lower limit to obtain all upper limits.

Class Limits
14,000 < 16,800
16,800 < 19,600
19,600 < 22,400
22,400 < 25,200
25,200 < 28,000
28,000 < 30,800
30,800 < 33,600

Step 3: Tally the raw data.


Class Limits Tally
14,000 < 16,800 IIII
16,800 < 19,600 IIII-IIII
19,600 < 22,400 IIII-IIII-IIII-I
22,400 < 25,200 IIII-IIII-IIII-IIII-III
25,200 < 28,000 IIII-IIII-IIII-II
28,000 < 30,800 IIII-III
30,800 < 33,600 III
Step 4: Convert the tallied data into numerical frequencies.

Class Limits Tally Frequency


14,000 < 16,800 IIII 4
16,800 < 19,600 IIII-IIII 9
19,600 < 22,400 IIII-IIII-IIII-I 16
22,400 < 25,200 IIII-IIII-IIII-IIII-III 23
25,200 < 28,000 IIII-IIII-IIII-II 17
28,000 < 30,800 IIII-III 8
30,800 < 33,600 III 3

Step 5: Determine the relative frequency. It can be found by dividing each frequency by
the total frequency.

Class Limits Frequency Relative Frequency Found by


14,000 < 16,800 4 0.05 4 ÷ 80
16,800 < 19,600 9 0.11 9 ÷ 80
19,600 < 22,400 16 0.20 16 ÷ 80
22,400 < 25,200 23 0.29 23 ÷ 80
25,200 < 28,000 17 0.21 17 ÷ 80
28,000 < 30,800 8 0.10 8 ÷ 80
30,800 < 33,600 3 0.04 3 ÷ 80

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


6
Step 6: Determine the percentage. It can be found by multiplying 100% in each
relative frequency.

Class Limits Frequency Percentage Found by


14,000 < 16,800 4 5 (4 ÷ 80 )𝑥 100
16,800 < 19,600 9 11 (9 ÷ 80) 𝑥 100
19,600 < 22,400 16 20 (16 ÷ 80) 𝑥 100
22,400 < 25,200 23 29 (23 ÷ 80) 𝑥 100
25,200 < 28,000 17 21 (17 ÷ 80) 𝑥 100
28,000 < 30,800 8 10 (8 ÷ 80) 𝑥 100
30,800 < 33,600 3 4 (3 ÷ 80 )𝑥 100

Step 7: Determine the cumulative frequencies. The cumulative frequency can be found
by adding the frequency in each class to the total frequencies of the classes
preceding that class.

Class Limits Frequency Cumulative Found by


Frequency
14,000 < 16,800 4 4 4
16,800 < 19,600 9 13 4+9
19,600 < 22,400 16 29 4 + 9 + 16
22,400 < 25,200 23 52 4 + 9 + 16 + 23
25,200 < 28,000 17 69 4 + 9 + 16 + 23 + 17
28,000 < 30,800 8 77 4 + 9 + 16 + 23 + 17 + 8
30,800 < 33,600 3 80 4 + 9 + 16 + 23 + 17 + 8 + 3
Step 8: Determine the midpoints. The midpoint can be found by getting the
average of the upper limit and lower limit in each class.

Class Limits Frequency Midpoints Found by


14,000 < 16,800 4 15 (14 + 16) ÷ 2
16,800 < 19,600 9 18 (17 + 19) ÷ 2
19,600 < 22,400 16 21 (20 + 22) ÷ 2
22,400 < 25,200 23 24 (23 + 25) ÷ 2
25,200 < 28,000 17 27 (26 + 28) ÷ 2
28,000 < 30,800 8 30 (29 + 31) ÷ 2
30,800 < 33,600 3 33 (32 + 34) ÷ 2

B. Graphing Statistical Data


When the data set contains large number of values, making conclusions from an
ordered array or stem-and-leaf plot is often difficult. We will need graphs or charts to
visually show numerical data. These include histogram, frequency polygon, and
cumulative frequency (ogive).
Histogram: A histogram is a graph in which the classes are marked on the
horizontal axis (x-axis) and the class frequencies on the vertical axis (y-axis). The height
of the bars represents the class frequencies, and the bars are drawn adjacent to each
other. Nevertheless, the histogram focuses on the frequency of each class and sacrifices
whatever information is contained in the actual observation.

Frequency Polygon: A frequency polygon is a graph that displays the data using
points which are connected by lines. The frequencies are represented by the heights of
the points at the midpoints of the classes. The vertical axis represents the frequency of
the distribution while the horizontal axis represents the midpoints of the frequency
distribution.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


7
Cumulative Frequency Polygon (Ogive): A cumulative frequency polygon or
ogive (read as oh’-jive) is a graph that displays the cumulative frequencies for the
classes in a frequency distribution. The vertical axis represents the cumulative frequency
of the distribution while the horizontal axis represents the upper class boundaries of the
frequency distribution.
Example 1: Shown below is the frequency distribution in Example 2.

Class Limits Class Midpoints Frequency 𝒄𝒇


Boundaries
14,000 < 16,800 13,500 − 16,500 15 4 4
16,800 < 19,600 16,500 − 19,500 18 9 13
19,600 < 22,400 19,500 − 22,500 21 16 29
22,400 < 25,200 22,500 − 25,500 24 23 52
25,200 < 28,000 25,500 − 28,500 27 17 69
28,000 < 30,800 28,500 − 31,500 30 8 77
30,800 < 33,600 31,500 − 34,500 33 3 80

Construct a histogram, frequency polygon, and cumulative frequency polygon.

Solution:

a. Constructing a Histogram

Step 1: Find the midpoints of each class.

Step 2: Draw and label the 𝑥 − 𝑎𝑥𝑖𝑠 and 𝑦 − 𝑎𝑥𝑖𝑠.

Step 3: Represent the frequency on the 𝑦 − 𝑎𝑥𝑖𝑠 and the midpoints on the 𝑥 − 𝑎𝑥𝑖𝑠.

Step 4: Use the frequency to represent the height and draw the vertical bars.

Histogram for Young Professionals ' Monthly Salary


25

20
Frequency

15

10

0
15 18 21 24 27 30 33
Salary (in Thousands)

b. Constructing a Frequency Polygon


Step 1: Find the midpoints of each class.
Step 2: Draw and label the 𝑥 − 𝑎𝑥𝑖𝑠 and 𝑦 − 𝑎𝑥𝑖𝑠.
Step 3: Represent the frequency on the 𝑦 − 𝑎𝑥𝑖𝑠 and the midpoints on the 𝑥 − 𝑎𝑥𝑖𝑠.
Step 4: Connect adjacent points with line segments. Draw a line back to the 𝑥 − 𝑎𝑥𝑖𝑠 at
the beginning and end of the graph.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


8
Frequency Polygon for Young Professionals ' Monthly Salary
25
20

Frequency
15
10
5
0
15 18 21 24 27 30 33
Salary (in Thousands)

c. Constructing a Cumulative Frequency Polygon (ogive)


Step 1: Find the cumulative distribution of the data set.
Class Limits Class Frequency 𝒄𝒇
Boundaries
14,000 < 16,800 13,500 − 16,500 4 4
16,800 < 19,600 16,500 − 19,500 9 13
19,600 < 22,400 19,500 − 22,500 16 29
22,400 < 25,200 22,500 − 25,500 23 52
25,200 < 28,000 25,500 − 28,500 17 69
28,000 < 30,800 28,500 − 31,500 8 77
30,800 < 33,600 31,500 − 34,500 3 80

Step 2: Draw and label the 𝑥 − 𝑎𝑥𝑖𝑠 and 𝑦 − 𝑎𝑥𝑖𝑠.


Step 3: Represent the frequency on the 𝑦 − 𝑎𝑥𝑖𝑠 and the upper class boundaries on
the 𝑥 − 𝑎𝑥𝑖𝑠.
Step 4: Connect the adjacent points with line segments.
Ogive for Young Professionals ' Monthly Salary
90
80
70
Cumulative Frequency

60
50
40
30
20
10
0
16.5 19.5 22.5 25.5 28.531.5 34.5

CARRY OUT

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


9
Practice makes perfect!!
In the table below is selected data for 20 college students.

Number Gender Major Distance to School (km) Number of Siblings


1 F BSED 5 3
2 F BSBA 9 5
3 M BSED 11 6
4 F AB 14 3
5 M BIT 3 3
6 F BSED 5 6
7 M AB 20 3
8 F AB 7 1
9 F BSBA 9 6
10 F AB 6 3
11 M BIT 10 4
12 F BSBA 10 2
13 M BSED 5 1
14 F BSBA 9 3
15 M EE 10 1
16 F AB 3 2
17 F BEED 15 4
18 M AB 6 5
19 F EE 30 3
20 M AB 8 5

Using the collected data in the table above


1. Prepare a frequency distribution for gender and major.

2. Construct a histogram, cumulative frequency polygon, and frequency polygon.

CHECKPOINT
I. Direction. Use the given data below to answer what is being asked.
1. A marketing research consultant conducted a survey of 40 persons who used
to visit fast food chains in one morning. The age of the persons was recorded
to the nearest year as follows.
16 29 44 36 40 24 28 47 34 46 35 26
50 33 38 19 22 53 44 55 32 21 44 41
19 40 30 47 47 27 50 33 46 48 29 27
32 31 42 28
Prepare a frequency distribution table using Rule 1 and Rule 2.
2. The daily number of machine copies made by Estaya copying center are
grouped into a table having the classes 0 − 49,50 − 99,100 − 149, 𝑎𝑛𝑑 150 −
199. Find
a) the class boundaries
b) the class midpoints
c) the class interval

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


10
3. SJS Travel Agency, a nationwide local travel agency, offers special rates on
summer period. The owner wants additional information on the ages of
those people taking travel tours. Construct a histogram, frequency
polygon, and cumulative frequency polygon or ogive. What conclusions
can you reach based on the information presented?

Class Limits Class Boundaries Frequency 𝒄𝒇


18 < 26 17.5 − 26.5 3 3
27 < 35 26.5 − 35.5 5 8
36 < 44 35.5 − 44.5 9 17
45 < 53 44.5 − 53.5 14 31
54 < 62 53.5 − 62.5 11 42
63 < 71 62.5 − 71.5 6 48
72 < 80 71.5 − 380.5 2 50

CONTEMPLATE ON IT
List down the concepts that you have learned from this lesson.

Related Readings:

https://www.statisticshowto.com/probability-and-statistics/descriptive-
statistics/frequency-distribution-

https://www.statisticshowto.com/probability-and-statistics/descriptive-statistics/

References:

Soriano, JM. (2019). Mathematics in the Modern World. Books Atbp. Publishing
Corp. 707 Tiaga cor. Kasipagan St., Barangka Drive, Madaluyong City
Philippines.

Sirug, WS., (2018). Mathematics in the Modern World, A CHED General


Education Curriculum Compliant. Mind shapers Co., Inc., Rm. 108, Intramuros,
Manila, Philippines 1002

Amid, DM. (2005), Fundamentals of Statistics. Lorimar Publishing Co.,Inc. 776


Aurora Blvd., cor. Boston Street, Cubao, Quezon City, Metro Manila

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


11
Lesson 2 Measures of Central Tendency

Learning Outcomes:
At the end of this lesson, you must have:
1. Discussed the properties of the three different measures of central tendency: the
mean, median and the mode;
2. Computed the measures of central tendency;
3. Manifested appreciation of the applications of the measures of central tendency in
daily life situation.

CATCH IT
The following are the scores of ten students in a 20 items math quiz:

Stusss Student Number Scores in math


quiz
1 15
2 14
3 17
4 18
5 15
6 19
7 20
8 11
9 16
10 13

a. What do you think is the average score?


b. Suppose you arrange the scores from lowest to highest, what do you think
would be the middle-most score?
c. Look at the scores. What scores appear most frequently?

CONCEPTUALIZE
a. How did you compute the average score? What greatly affects the mean?
b. What do we call the middle most score? What affects this value?
c. How many scores appeared frequently? What are they? What do we call
them? Would it be possible that you can easily determine the mode?

I. MEAN
The arithmetic mean, often called as the mean, is the most frequently used
measure of central tendency. The mean is the only common measure in which all values
play an equal role, meaning to determine its values you would need to consider all the
values of any given data set. The mean is appropriate to determine the central tendency
of an interval or ratio data.

The symbol” x “bar is used to represent the mean of a sample and the symbol “µ” is used
to denote the mean of a population.

A. Properties of the Mean


1. A set of a data has only one mean.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


12
2. Mean can be applied for interval and ratio data.
3. All values in the data set are included in computing the mean.
4. The mean is very useful in comparing two or more data sets.
5. Mean is affected by the extreme small or large values on a data set.
6. Mean is the most appropriate in symmetrical data.

The formula for the MEAN


𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠
Mean = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠

B. Mean for ungrouped Data:

𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠


𝑀𝑒𝑎𝑛 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑉𝑎𝑙𝑢𝑒𝑠
∑𝑥
Sample Mean: 𝑥̅ = 𝑛
Where: 𝑥̅ = sample mean
𝑥 = the value of any particular observations or measurement.
∑ 𝑥 = sum of all x′s
n = total number of values in a sample.
∑𝑥
Population Mean: 𝜇 = 𝑁
Where: 𝜇 = population mean
𝑥 = 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑎𝑛𝑦 𝑝𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑜𝑟 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑚𝑒𝑛𝑡.
𝑁 = total number of values in the population.

Example 1: The daily rate of a sample of eight employees at GMS Inc. are
₱550,₱420, ₱560, ₱500, ₱700, ₱670, ₱860, ₱480. Find the mean daily rate of
employees.

Solution:
∑ 𝑥 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 + 𝑥5 + 𝑥6 + 𝑥7 + 𝑥8
𝑥̅ = =
𝑛 𝑛
∑ 𝑥 550 + 420 + 560 + 500 + 700 + 670 + 860 + 480 4,740
𝑥̅ = = = = 592.50
𝑛 8 8

The sample mean daily salary of employees is ₱592.50.

Example 2: Find the population mean of the ages of 9 middle –management


employees of a certain company. The ages are 53, 45, 59, 48, 54, 46, 51, 58, and
55.
Solution:
∑ 𝑥 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 + 𝑥5 + 𝑥6 + 𝑥7 + 𝑥8 + 𝑥9
𝜇= =
𝑁 𝑁
∑ 𝑥 53 + 45 + 59 + 48 + 54 + 46 + 51 + 58 + 55 469
𝜇= = = = 52.11
𝑁 9 9
The mean population age of middle- management employee is 52.11

C. Mean for Grouped Data

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


13
∑ 𝑓𝑥
Sample Mean: 𝑥̅ = 𝑛
Where: 𝑥̅ = sample mean
𝑓 = frequency
𝑥 = the value of any particular observations or measurement.
∑ 𝑓𝑥 = sum of all the products of 𝑓 and 𝑥 ′ 𝑠.
𝑛 = total number of values in the sample
∑ 𝑓𝑥
Population Mean: 𝜇 = 𝑁

𝑓 = frequency
𝑥 = midpoint
∑ 𝑓𝑥 = sum of all the products of 𝑓 and 𝑥 ′ 𝑠.

𝑁 = total number of values in the population

Example 3: Determine the mean of the frequency distribution on the ages of 50


people taking travel tours. Given the table

Class Limits Frequency (f )


18-26 3
27-35 5
36-44 9
45-53 14
54-62 11
63-71 6
72-80 2

Solution:
Step1. Determine the midpoints on each class limit.
Step 2. Multiply each class frequency (f ) with the corresponding midpoint
(x) to obtain the product of fx.
Step 3. Get the sum of the product of fx.
Step 4. Apply the formula to obtain the value of the sample mean.

Class Limits Frequency (f Midpoint fx


) (x )
18-26 3 22 66
27-35 5 31 155
36-44 9 40 360
45-53 14 49 686
54-62 11 58 638
63-71 6 67 402
72-80 2 76 152
Total 50 ∑ 𝒇𝒙 = 𝟐, 𝟒𝟓𝟗

∑ 𝑓𝑥 2,459
𝑥̅ = = = 49.18
𝑛 50
Thus, the mean age of the frequency distribution of people taking travel is
49.18

Weighted Mean, Geometric Mean, and Combined Mean

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


14
A. Weighted Mean
The weighted mean is particularly useful when various classes or
groups contribute differently to the total. The weighted mean is found
by multiplying each value by its corresponding weight and dividing the
sum of the weights.
∑𝑛𝑖=1 𝑤𝑖 𝑥𝑖 𝑤1 𝑥1 + 𝑤1 𝑥1 … + 𝑤𝑛 𝑥1
𝑥̅𝑤 = =
∑𝑛𝑖=1 𝑤𝑖 𝑤1 + 𝑤2 + ⋯ 𝑤𝑛

Where: 𝑥̅𝑤 = weighted mean

𝑤𝑖 = corresponding weight

𝑥𝑖 = the value of any particular observations or measurement

Example 1: At the Mathematics Department in a State College there are 18


instructors, 12 assistant professors, 7 associate professors, and 3 professors.
Their monthly salaries are Ᵽ30,500, Ᵽ33,700, Ᵽ38,600, and Ᵽ45,000. What is the
weighted mean salary?

Solution:
∑𝑛𝑖=1 𝑤𝑖 𝑥𝑖 𝑤1 𝑥1 + 𝑤1 𝑥1 … + 𝑤𝑛 𝑥1
𝑥̅𝑤 = =
∑𝑛𝑖=1 𝑤𝑖 𝑤1 + 𝑤2 + ⋯ 𝑤𝑛
(18)(30,500) + (12)(33,700) + (7)(38,600) + (3)(45,000) 1,358,600
𝑥̅𝑤 = =
18 + 12 + 7 + 3 40
= 33,965

The weighted mean salary is Ᵽ33,965.

B. Geometric Mean

The geometric mean of a set of 𝑛 positive numbers is defined as the


𝑛 root of the product of the 𝑛 numbers. There are two main applications of
𝑡ℎ

geometric mean, the first is to average percents, indexes, and relatives, the
second is to establish the average percent increase in production, sales, or other
business transactions or economic series from one period of time to another.

Formulas:
𝑛
(1) 𝐺𝑀 = √(𝑥1 )(𝑥2 )(𝑥3 ) … 𝑥𝑛

𝑛−1 𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑡ℎ𝑒 𝑒𝑛𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑖𝑜𝑑


(2) 𝐺𝑀 = √ −1
𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑡ℎ𝑒 𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑖𝑜𝑑

Where: 𝐺𝑀 = geometric mean


𝑥𝑖 = value of any particular observations or measurement
𝑛 = number of observations

Example 2:

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


15
Suppose the profits earned by the MSS Construction Company on five projects
were:
5, 6, 4, 8, and 10 percent, respectively. What is the geometric mean profit?
Solution:
𝑥1 = 5, 𝑥2 = 6, 𝑥3 = 4, 𝑥4 = 8, 𝑥5 = 10, 𝑛=5
𝑛
𝐺𝑀 = √(𝑥1 )(𝑥2 )(𝑥3 )(𝑥4 )(𝑥5 )
5 5
𝐺𝑀 = √(5)(6)(4)(8)(10) = √9,600 = 6.26

The geometric mean profit is 6.26 percent.

Example 3:

Badminton as a sport grew rapidly in 2015. From January to December 2015 the
number of badminton clubs in Metro Manila increased from 20 to 155. Compute the
mean monthly percent increase in the number of badminton clubs.

Solution:

Note that 12 months are involved. However, there are only 11 monthly rates of change.
That is, we compute the changes from January to February, from February to March,
March to April, April to May, and so forth. So 𝑛 is 12 and 𝑛 − 1 = 11 monthly percent
increases.

𝑛−1 𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑡ℎ𝑒 𝑒𝑛𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑖𝑜𝑑


𝐺𝑀 = √ −1
𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑡ℎ𝑒 𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑖𝑜𝑑

12−1 155 11
𝐺𝑀 = √ − 1 = √7.75 − 1 = 0.2046
20
Hence, badminton clubs are increasing at a rate of almost 0.2046 𝑜𝑟 20.46% per
month.

II. MEDIAN
The median is the midpoint of the data array. When the data set is
ordered, whether ascending or descending, it is called a data array. Median is
an appropriate measure of central tendency for data that are ordinal or above,
but it is more valuable in an ordinal type of data.

A. Properties of Median
1. The median is unique, there is only one median for a set of data.
2. The median is found by arranging the set of data from lowest to highest (highest
to lowest) and getting the value of the middle observation.
3. Median is affected by the number of values.
4. Median can be applied for ordinal, interval and ratio data
5. Median is most appropriate in a skewed data.

B. Median for Ungrouped Data


To determine the value of median for ungrouped data we need to consider two
rules:
1. If 𝑛 is odd, the median is the middle ranked.
2. If 𝑛 is even, then the median is the average of the two middle ranked values.
𝑛+1
Median (𝑅𝑎𝑛𝑘 𝑉𝑎𝑙𝑢𝑒) = 2
Note: 𝑛 is the sample size. For population, we use 𝑁.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


16
Example 1: Find the median of the ages of 9 middle- management employees of
a certain company. The ages are: 53, 45, 59, 48, 54, 46, 51, 58, 𝑎𝑛𝑑 55.
Solution:
Step 1: Arrange the data in order.
45, 46, 48, 51, 53, 54, 55, 58, 59

𝑛+1
Step 2: Select the middle rank value using the formula: (𝑅𝑎𝑛𝑘 𝑉𝑎𝑙𝑢𝑒) = 2

𝑛+1 9+1 10
Median (𝑅𝑎𝑛𝑘 𝑉𝑎𝑙𝑢𝑒) = 2
= 2
= 2
=5

Step 3: Identify the median in the data set.


45, 46, 48, 51, 53, 54, 55, 58, 59

5th
Hence, the median age is 53 years.

Example 2: The daily rates of a sample of eight employees at GMS Inc. are
Ᵽ550, Ᵽ420, Ᵽ560, Ᵽ500, Ᵽ700, Ᵽ670, Ᵽ860, Ᵽ480. Find the median daily rate of
employees.

Solution:
Step 1: Arrange the data in order.

Ᵽ420, Ᵽ480, Ᵽ500, Ᵽ550, Ᵽ560, Ᵽ670, Ᵽ700, Ᵽ860

𝑛+1
Step 2: Select the middle rank value using the formula: (𝑅𝑎𝑛𝑘 𝑉𝑎𝑙𝑢𝑒) = 2

𝑛+1 8+1 9
Median (𝑅𝑎𝑛𝑘 𝑉𝑎𝑙𝑢𝑒) = 2
= 2
= 2 = 4.5

Step 3: Identify the median in the data set.

Ᵽ420, Ᵽ480, Ᵽ500, Ᵽ550, Ᵽ560, Ᵽ670, Ᵽ700, Ᵽ860

4.5th
Since the middle point falls between Ᵽ550 𝑎𝑛𝑑 Ᵽ560, we can determine the
median of the data set by getting the average of the two values.

550+560 1,110
Median = = = 555
2 2

Therefore, the median daily rate is Ᵽ555.

III. Mode
The mode is the value in a data set that appears most frequently. Like the
median, and unlike the mean, extreme values in a data set do not affect the
mode. A data may not contain any mode if none of the values are “most typical”.

A data set that has only one value that occurs the greatest frequency is
called the unimodal. If the data has two values with the same greatest
frequency, both values are considered the mode and the data set is bimodal. If
the data set has more than two modes, then the data set is said to be
multimodal. There are some cases when a data set value has the same number
frequency. When this occurs, the data set is said to be no mode.

A. Properties of Mode

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


17
1. The mode is found by locating the most frequently occurring value.
2. The mode is the easiest average to compute
3. There can be more than one mode or even no mode in any given data set.
4. Mode is not affected by the extreme small or large values.
5. Mode can be applied for nominal, ordinal, interval and ratio data.

Example 1: The following data represent the total unit sales for smart phones from
a sample of 10 communication centers for the month of August:
15, 17, 10, 12, 13, 10, 14, 10, 8, 𝑎𝑛𝑑 9. Find the mode.

Solution: The ordered array for these data is 8, 9,10, 10, 10, 12, 13, 14, 15, 17, since 10
appear 3 times more times thnan the other, therefore the mode is 10.

Example 2: An operations manager in charge of a company’s manufacturing keeps


track of the number of manufactured LED television in a day. Find the mode of the
following data:20, 18, 19, 25, 20, 21, 20, 25, 30, 29, 28, 29, 25, 25, 27, 26, 22, 𝑎𝑛𝑑 20

Solution: The ordered array for these data:

19, 20, 20,20, 20, 21, 22,25,25,25,25,26,27,28,29,29,30

There are two modes 20 and 25, since each of these values occurs four times.

CARRY OUT

Solve: A sales person records the following daily expenditures during a ten day trip:
₱233.04 ₱198.75 ₱166.85 ₱343.60 ₱201.50
₱527.92 ₱455.36 ₱354.72 ₱198.75 ₱ 769.25
1. What is the mean of the given data?
2. What is the median of the given data?
3. What data occurs most frequently?
4. Suppose the salesperson had another 2-days trip with a total expenditure of
₱345.00, what will be the new mean expenditure?
5. What will be the new median?

CHECKPOINT

Direction: Compute the mean, median, and mode of the following problems.

1. A college professor administered a unit exam to one of his classes and found that the majority
of the items were too easy. The scores are:
45, 39, 40, 48, 35, 37, 36, 37, 40, 44, 41, 49, 29, 8, 32, 36, 37, 41, 40, 36, 39, 30, 25, 43, 𝑎𝑛𝑑 50.

2. The Department of Agriculture conducted of survey of farmers in Guimaras. The following is


the list of acres farmed by a sample of 20 farming families through –out the province:

200, 1 200, 300, 350, 500, 550, 1 500, 400, 500, 800, 850, 1 300, 2 000, 2 100, 340, 760, 830, 2 670, 990
𝑎𝑛𝑑 3 000 .

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


18
3. A sales manager would like to see each of his sales representatives’ units sales per month. A
new recruit is told to keep a weekly record of the sales. The following are the data from the
previous month: 10, 14, 17, 18, 30, 28, 27, 17, 21, 23, 43, 21, 26, 28, 16, 6, 15, 𝑎𝑛𝑑 10.

CONTEMPLATE ON IT
List down the concepts that you have learned from this lesson.

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
___________________________________________________________________

REFERENCES:
Soriano, JM. (2019). Mathematics in the Modern World. Books Atbp. Publishing Corp.
707 Tiaga cor. Kasipagan St., Barangka Drive, Madaluyong City Philippines.

Sirug, WS. (2015), Basic Probability and Statistics, A Step by Step Approach. Mind
shapers Co., Inc. Rm. 108, Intramuros Corporate Plaza Bldg., Recoletos St.
Intramuros Manila, Philippines.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


19
Lesson 3 Measures of Dispersion

Learning Outcomes:
At the end of this lesson, you must have:

a) Computed the measures of variability such as range, quartile deviation, mean absolute
deviation variance and standard deviation of a given data set.
b) Discussed the properties of the different measures of variability.
c) Appreciated the application of the range and standard deviation in the analysis of data.

CATCH IT

1. What you mean when you say dispersion?


2. How do you know that the given data disperse or closed to each other?
3. How do you know that the given data is heterogeneous or homogenous?

CONCEPTUALIZE
Another important characteristic of data set is how it is distribute, or how far each
element is from some measure of central tendency.

A. Range

The range is the simplest and easiest way to determine measure of dispersion. It is the
difference of the highest value and the lowest value in the data set. There are two advantages
of the range: (i) it is easy to compute and (ii) it is easy to understand.
On the other hand, it also has two disadvantages, (i) it can be distorted by a single extreme
value (or outlier) and (ii) only two values are used in the calculation.
Properties of Range
1. It is a quick but rough measure of dispersion.
2. The larger the value of the range, the more dispersed is the observation.
3. It considers only the lowest and the highest values in the population.

Example 1: The daily rates of a sample of eight employees at GMS Inc. are
Ᵽ550, Ᵽ420, Ᵽ560, Ᵽ500, Ᵽ700, Ᵽ670, Ᵽ860, Ᵽ480. Find the range.
Solution:
Step 1: Determine the highest value and the lowest value in the data set.
𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒 (𝐻𝑉) = Ᵽ860 𝐿𝑜𝑤𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒(𝐿𝑉) = Ᵽ420
Step 2: Solve for the range.
𝑅𝑎𝑛𝑔𝑒 = 𝐻𝑉 − 𝐿𝑉 = Ᵽ860 − Ᵽ420 = Ᵽ440
The range in daily rate salary is Ᵽ440.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


20
B. Variance and Standard Deviation

One of the most widely used measures of dispersion is the standard deviation.
The more spread apart the data, the higher the deviation. Standard deviation is calculated as
the square root of the variance. In finance, standard deviation is applied to the annual rate of
return of an investment to measure the investment’s volatility. Standard deviation is also known
as historical volatility and is used by investors as a gauge for the amount of expected volatility.

A measure of the dispersion of a set of data points around their mean value.
Variance is a mathematical expectation of the average squared deviations from the mean.

Properties of Variance

1. The variance is always non-negative


2. The variance is easy to manipulate for further mathematical treatment.
3. The variance makes use of all observations.
4. The unit of measurement for the variance is the square of the unit of measure of the
given set of values. Thus, for example, if a data set has an inch as the unit of measure,
the unit of its variance will be in squared inches.

Sample Variance and Sample Standard Deviations for Ungrouped Data


∑(𝑥−𝑥̅ )2 ∑(𝑥−𝑥̅ )2
𝑠2 = 𝑛−1
𝑠=√ 𝑛−1

(∑ 𝑥)2 2 −(∑ 𝑥)
2
∑ 𝑥2− ∑𝑥
2
𝑠 = 𝑛
𝑠= √ 𝑛
𝑛−1 𝑛−1

Where: 𝑠 2 = sample variance


𝑠 = sample standard deviation
𝑥 = the value of any particular observation or measurement
𝑥̅ = sample mean
𝑛 = sample population

Example 2: The daily rates of a sample of eight employees at GMS Inc. are
Ᵽ550, Ᵽ420, Ᵽ560, Ᵽ500, Ᵽ700, Ᵽ670, Ᵽ860, Ᵽ480. Find the variance and standard deviation.

Solution:

Step 1: Compute the mean of the data set.

∑ 𝑥 550 + 420 + 560 + 500 + 700 + 670 + 860 + 480 4 740


𝑥̅ = = = = 592.50
𝑛 8 8

Step 2: Subtract the mean from each of the value in the data set.

𝑥 𝑥 − 𝑥̅
550 −42.5
420 −172.5
560 −32.5
500 −92.5
700 107.5
670 77.5
860 267.5
480 −112.5
∑ 𝑥 = 4 740 ∑(𝑥 − 𝑥̅ ) = 0
Step 3: Square the 𝑥 − 𝑥̅ , then get the sum.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


21
𝑥 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2
550 −42.5 1, 806.25 (−42.5)2 = 1,806.25
420 −172.5 29, 756.25
560 −32.5 1,056.25
500 −92.5 8,556.25
700 107.5 11,556.25
670 77.5 6,006.25
860 267.5 71,556.25
480 −112.5 12,656.25
∑ 𝑥 = 4 740 ∑(𝑥 − 𝑥̅ ) = 0 ∑(𝑥 − 𝑥̅ )2 = 142,950

Step 4: Solve for the variance and the standard deviation. We can obtain the standard deviation
by simply extracting the square root of the variance.

∑(𝑥−𝑥̅ )2 142,950 ∑(𝑥−𝑥̅ )2 142,950


𝑠2 = 𝑛−1
= 8−1
= 20,421.43 𝑠=√ 𝑛−1
=√ 8−1
= √20,421.43 = 142.90

Hence, the variance is Ᵽ20,421.43 and the standard deviation is Ᵽ142.90.

Alternative Solution: An alternative solution can be done using the formula:

(∑ 𝑥)2 2 (∑ 𝑥)
2
∑ 𝑥2− ∑𝑥 −
2
𝑠 = 𝑛
𝑠 = √ 𝑛−1 𝑛
𝑛−1

Step 1: Get the sum of the data set.

𝑥
550
420
560
500
700
670
860
480
∑ 𝑥 = 4,740
Step 2: Square the values in the data set and get the sum.

𝑥 𝑥2
550 302, 500
420 176,400
560 313,600
500 250,000
700 490,000
670 448,900
860 739,600
480 230,400
∑ 𝑥 = 4,740 ∑ 𝑥 2 = 2,951,400
Step 3: Solve for the values of the variance and standard deviation.
(∑ 𝑥)2 (4,740)2
∑ 𝑥2− 2,951,400− 2,951,400−2,808,450
2 𝑛 8
𝑠 = 𝑛−1
= 8−1
= 7
= 20,421.43

2 (∑ 𝑥)2 (4,740)2
√∑ 𝑥 − 𝑛 √2,951,400 − 8 2,951,400 − 2,808,450
𝑠= = =√ = √20,421.43 = 142.90
𝑛−1 8−1 7
Thus, the variance is Ᵽ20,421.43 and the standard deviation is Ᵽ142.90

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


22
Population Variance and Population Standard Deviation
∑(𝑥 − 𝜇)2 ∑(𝑋 − 𝜇)2
𝜎2 = 𝜎=√
𝑁 𝑁

Where: 𝜎 2 = population variance


𝜎 = population standard deviation
𝑥 = the value of any particular observation or measurement
𝜇 = population mean
𝑁 = population

Example 3: The monthly incomes of the five research directors of Recoletos schools are:
Ᵽ55, 000, Ᵽ 59,500, Ᵽ62,500, Ᵽ57,000, Ᵽ61,000. Find the variance and standard deviation.

Solution:

Step 1: Compute the mean of the data set.

∑ 𝑥 55,000 + 59,500 + 62,500 + 57,000 + 61,000 295,000


𝜇= = = = 59,000
𝑁 5 5

Step 2: Subtract the population mean from each of the value in the data set.

𝑥 𝑥−𝜇
55,000 −4,000
59,500 500
62,500 3,500
57,000 −2,000
61,000 2,000

Step 3: Get the square of 𝑥 − 𝜇, then get the sum.

𝑥 𝑥−𝜇 (𝑥 − 𝜇)2
55,000 −4,000 16,000,000
59,500 500 250,000
62,500 3,500 12,250,000
57,000 −2,000 4,000,000
61,000 2,000 4,000,000
∑ 𝑥 = 295,000 ∑(𝑥 − 𝜇) = 0 ∑(𝑥 − 𝜇)2 = 36,500,000

Step 4: Solve for the population variance and population standard deviation.

∑(𝑥 − 𝜇)2 36,500,000


𝜎2 = = = 730,000
𝑁 5

∑(𝑋 − 𝜇)2
𝜎=√ = √730,000 = 2,701.85
𝑁

Hence, the population variance is 730,000 and the population standard deviation is 2,701.85

CARRY OUT

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


23
1. A time-study analyst observed a packaging operation and collected the following
times (in seconds) required for the operation to fill packages of a fixed volume box:
11,12, 15, 18, 13, 18, 16, 14, 12, 𝑎𝑛𝑑 17. Find the range, variance, and standard deviation.

CHECKPOINT

Direction: Read, analyze, and compute the measuresof dispersion the following data.

1. The following data give the weight (in pounds) gain by 8 employees at the end of the
Christmas gatherings. Complete the table and compute the variance and standard
deviation. 2
𝑥 𝑥
11
9
10
6
5
4
8
7

2. The supervisor of a fastfood restaurant selected several receipts at random. The


amounts spent by customers were Ᵽ75, Ᵽ60, Ᵽ65, Ᵽ62, Ᵽ80, Ᵽ83, Ᵽ91, 𝑎𝑛𝑑 Ᵽ78. Complete
the table and find the variance and standard deviation.

𝑥 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2
75
60
65
62
80
83
91
78

CONTEMPLATE ON IT
List down the concepts that you have learned from this lesson.

Lesson 4 Measures of Relative Position

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


24
Learning Outcomes:
At the end of this lesson, you must have:

a) determined the relative position of the given data


b) solved the quartiles and 𝑧 − 𝑠𝑐𝑜𝑟𝑒 of the given data
c) illustrated the relative position of the give data using the box –and -whisker plot.

CATCH UP

1. Which of the following test scores is the highest relative score?


a. A score of 65 on a test with a mean of 72 and a standard deviation of 8.2
b. A score of 102 on a test with a mean of 130 and a standard deviation of 18.5
c. A score of 605 on a test with a mean of 720 and a standard deviation of 116.4
d. A score of 65 on a test with a mean of 72 and a standard deviation of 18.2
2. If I tell you that you scored at the 50th percentile on your final exam, you would
know:

a.50% of the class scored as good or as worse than I did


b. I earned a 50% on the exam
c.50% of the class scored worse than I did
d.50% of the class scored as good than I did

CONCEPTUALIZE

When presenting or analyzing data set it is sometimes helpful to group subjects into
several equal groups. To create four equal groups we need the values that split the data such
that 25% of the observations are in each group. The cut off point re called quartiles, the
general term or such cut off points is quantiles. Deciles which split data into 10 parts,
percentiles which split the data into 100 parts.

Values such as quartiles can also be expressed as percentiles; for example, the lowest
quartile is also the 25th percentile and the median is the 50th percentile or the 5th decile.

A. Quartiles
𝑘(𝑁+1)
𝑄𝑘 =
4

Where: 𝑄𝑘 = quartile
𝑁 = population
𝑘 = quartile location
Example 1: Find the first, second, and third quartiles f the ages of 9 middle- management
employees of a certain company. The ages are 53, 45, 59, 48, 54, 46, 51, 58, 𝑎𝑛𝑑 55.

Solution:
Step 1: Arrange the data in order.
45, 46, 48, 51, 53, 54, 55, 58, 59

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


25
𝑘(𝑁+1)
Step 2: Select the first, second, and third quartiles value using the formula: 𝑄𝑘 = 4
𝑘(𝑁 + 1) 1(9 + 1) 10
𝑄1 = = = = 2.5
4 4 4
𝑘(𝑁 + 1) 2(9 + 1) 20
𝑄2 = = = =5
4 4 4
𝑘(𝑁 + 1) 3(9 + 1) 30
𝑄3 = = = = 7.5
4 4 4
Step 3: Identify the first, second, and third quartiles values in the data set.
45, 46, 48, 51, 53, 54, 55, 58, 59

2.5th 5th 7.5th

Since the 2.5th falls between 46 and 48; and 7.5th falls between 55 and 58 we can
determine the first and third quartiles of the data set by getting the average of the two
values.
46+48 94 55+58 113
𝑄1 = = = 47 𝑄3 = = = 56.5
2 2 2 2
Therefore, 𝑄1 = 47, 𝑄2 = 53 𝑎𝑛𝑑 𝑄3 = 56.5

B. z-Score

𝒛 − 𝒔𝒄𝒐𝒓𝒆 is used to know the position of one observation relative to others in a set of
data we apply 𝑧 − 𝑠𝑐𝑜𝑟𝑒. For example, we want to know a score of a student of 42 compared to
the scores of the other students in the class based from a quiz on a total of 50 points. The mean
and the standard deviation of the scores can be used to compute the 𝑧 − 𝑠𝑐𝑜𝑟𝑒, which will
measure the relative standing of a measurement in a data set.

A 𝑧 − 𝑠𝑐𝑜𝑟𝑒 measures the distance between an observation and the mean, measured in
units of standard deviation. The following formulas show how to compute the 𝑧 − 𝑠𝑐𝑜𝑟𝑒 for a
data value 𝑥 in a population and in a sample.
𝑥−𝜇
𝑧 = 𝜎 (for population)

Example 1: The monthly expenditures of a large group of households are normally distributed
with a mean of Ᵽ48, 700 and a standard deviation of Ᵽ10,400. what is the 𝑧 − 𝑣𝑎𝑙𝑢𝑒nof monthly
expenditures of Ᵽ59,400 and Ᵽ38,300?
Solution:
Let 𝜇 = 48,700 𝜎 = 10,400
Using the formula of 𝑧 to determine 𝑧 − 𝑣𝑎𝑙𝑢𝑒𝑠 for the two 𝑥 values (Ᵽ59,400 and Ᵽ38,300) are
computed as follows:

(𝑥−𝜇) 59,400−48,700
For 𝑥 = 59,400: 𝑧= 𝜎
= 10,400
= 1.00
(𝑥−𝜇) 38,300−48,700
For 𝑥 = 38,300; 𝑧= 𝜎
= 10,400
= −1.00

The 𝑧 of 1.00 indicates that the monthly expenditures of Ᵽ59,400 for households is one standard
deviation above the mean, and a 𝑧 of −1.00 shows that a Ᵽ38,300 monthly expenditures is one
standard deviation below the mean. Note that both household monthly expenditures
(Ᵽ59,400 and Ᵽ38,300) are the same distance Ᵽ10,400 from the mean.

Example 2: A normal curve has a mean of 650 and a standard deviation of 40. An analyst is
interested in value of 575 and wants to find its equivalent 𝑧 − 𝑠𝑐𝑜𝑟𝑒.
Solution:
Given: 𝑥̅ = 650 𝑠 = 40 𝑥 = 575

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


26
Substitute the given values into the 𝑧 − 𝑠𝑐𝑜𝑟𝑒 of − 1.875
(𝑥 − 𝑥̅ ) 575 − 650
𝑧= = = −1.875
𝑠 40

Example 3: A time study reports indicates that can assembly line task should be finished in an
average of 5.64 minutes, with a standard deviation of 0.97 minutes. One particular item had a
𝑧 − 𝑠𝑐𝑜𝑟𝑒 of 1.53. What was the completion time of this item?

Solution:
Given: 𝑥̅ = 5.64 𝑠 = 0.97 𝑧 = 1.53
Substituting the given values t determine the 𝑥 − 𝑣𝑎𝑙𝑢𝑒, we get
𝑧 − 𝑠𝑐𝑜𝑟𝑒 = (𝑥 − 𝑥̅ ) 𝑥 = 𝑥̅ + 𝑧𝑠
𝑥 = 𝑥̅ + 𝑧𝑠 = 5.64 + (1.53)(0.97) = 5.64 + 1.4841 = 7.1241 ≈ 7.12 𝑚𝑖𝑛𝑠

The item had an assembly time of 7.12 minutes.

C. Box-and –Whisker Plot

John Wilder Tukey (1915 − 2000) introduced the boxplot in the 1970’s. A box-plot or
(𝒃𝒐𝒙 − 𝒂𝒏𝒅 − 𝒘𝒉𝒊𝒔𝒌𝒆𝒓 𝒑𝒍𝒐𝒕) is graph of the data set obtained by drawing a horizontal line from
the minimum data value to first quartile (𝑄1 ), drawing a horizontal line to third quartile (𝑄3 ) to
the maximum data value, and drawing a box whose vertical line passes through 𝑄1 and 𝑄3 with
a vertical line inside the box passing through the median or (𝑄2 ).

The box-plot will give the following information:

1. If the median is near the center of the box, the distribution is approximately symmetric.
2. If the median falls to the right of the center of the box, the distribution is negatively skewed.
3. If the median falls to the left of the center of the box, the distribution is approximately
positively skewed.
4. If the lines are about the same length, the distribution is approximately symmetric.
5. If the left line is larger than the right line, the distribution is negatively skewed.
6. If the right line is larger than the left line, the distribution is positively skewed.

𝑥 𝑙𝑜𝑤𝑒𝑠𝑡 𝑥 ℎ𝑖𝑔ℎ𝑒𝑠𝑡

𝑄1 𝑄2 = 𝑚𝑒𝑑𝑖𝑎𝑛 𝑄3

0 10 20 30 40 50 60

Example 1: Construct a box-plot for the data set of the ages of 9 middle- management
employees of a certain company. The ages are 53, 45, 59, 48, 54, 46, 51, 58, 55.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


27
Solution:
Step 1: Determine the 𝑄1 , Median, and 𝑄3 of the given set.
Recall that 𝑄1 = 47, 𝑀𝑒𝑑𝑖𝑎𝑛 = 53, 𝑄3 = 56.5

Step 2: Locate the lowest value, 𝑄1 , the median, 𝑄3 , and the highest value on the square.

Step 3: Draw a box around 𝑄1 and 𝑄3 , draw a vertical line through the median and connect the
upper and lower values.

𝑄1 =47 𝑄3 = 56.5

45 59

𝑚𝑒𝑑𝑖𝑎𝑛 = 53

40 45 50 55 60

The data set of the distribution is negatively skewed, since the median falls to the right of the
center of the box.

CARRY OUT

Direction: Read, analyze and the following problem.


1. The salary of junior executives in a large corporation in Ortigas Area is normally
distributed with a standard deviation of Ᵽ15,600. Cutback spending, at which time those
who earn less than Ᵽ85,000 will be discharged. If such a cut represents a 𝑧 − 𝑠𝑐𝑜𝑟𝑒 of
−1.28 of the junior executives, what is the mean salary of the group of junior
executives?

CHECKPOINT

Direction: Read, analyze, and solve the given problems.


1. The life expectancy of a particular car battery is 24 months with a standard deviation of 2
months. What is the 𝑧 − 𝑠𝑐𝑜𝑟𝑒 if a particular car battery lasted for only 20 months?

2. Fifteen randomly selected business administration students were asked to state the number of
hours they slept last Sunday. The resulting data are 4,5,7,6,7,8,10,5,4,11,12,11,10,8, 𝑎𝑛𝑑 7. Find
the first quartile, second quartile, and third quartile.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


28
3. Construct a Box-and-Whisker plot. Fifteen companies were asked, “To how many charitable
institutions did you give your cash donations during this pandemic (2020)?” The companies
responded that they gave money to 18, 15, 16, 10, 14, 13, 17, 20, 23, 27, 20, 19, 18, 27, 𝑎𝑛𝑑 22.

CONTEMPLATE ON IT

What have you learned from these lessons?


________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

Lesson 5 Probabilities and Normal


Distribution
Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page
29
Learning Outcomes:
At the end of this lesson, you must have:
a) discussed the properties of normal distribution
b) constructed a normal curve given the data

CATCH IT

CONCEPTUALIZE

The normal distribution or Gaussian distribution is a continuous probability


distribution that describes data that clusters around a mean. The graph of the associated
probability density function is bell-shaped, with a peak at the mean, and is known as the
Gaussian function or bell curve.
The normal curve was developed mathematically in 1733 by Abraham de Moivre (1667-
1754) as an approximation to the binomial distribution. His paper was not discovered until 1922
by Karl Pearson (1857-1936). Pierre-Simon Laplace (1749-1827) used the normal curve in 1783
to describe the distribution of errors. Subsequently, Carl Friedrich Gauss (1777-1855) used the
normal curve to analyze astronomical data in 1809. The normal curve is often called the
Gaussian distribution. The term bell-shaped curve is often used in everyday activity.
The normal distribution can be used to describe at least approximately any variable that
tends to cluster around the mean.
For example, the heights the adult males in the Philippines are roughly normally
distributed. Most men have height closed to the mean, though a small number of outliers have a
height significantly above or below the mean.
A histogram of male heights will appear similar to a bell curve, with the correspondence
becoming closer if more data is used.
For example, if a research investigator selects a random sample of 100 adult males,
measures their height, and constructs a histogram, he researcher gets the graph similar to the
one presented. Now if the investigator increases the sample size and decreases the width of
classes, the histogram will look like the ones presented. Lastly, if it were possible to measure the
heights of all adults in the Philippines, the histogram would come close to what is called a
normal distribution.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


30
(a) Random Sample of 100 Male (b) Sample size increased & class width decreased

(c) Sample size increased & class width (d) Normal distribution for the population
decreased further

34.13% 34.13%

13.59% 13.59%

2.28% 2.28%

-3 -2 -1 0 1 2 3 𝑥
𝜇 − 3𝜎 𝜇 − 2𝜎 𝜇 − 1𝜎 𝜇 𝜇 + 1𝜎 𝜇 + 2𝜎 𝜇 + 3𝜎

𝐴𝑏𝑜𝑢𝑡 68%

𝐴𝑏𝑜𝑢𝑡 95%

𝐴𝑏𝑜𝑢𝑡 99.7%

A normal distribution is a continuous, symmetric, bell-shaped distriibution of


variable. The known characteristics of a normal curve make it possible to estimate tge probability
of occurrence of any value of a normally distributed variable. The properties of the normal
distribution are as follows:
1. The distribution is bell-shaped.
2. The mean, median, and mode are equal and are located at the center of the
distribution.
3. The normal distribution curve is unimodal.
4. The normal distribution curve is symmetric about the mean.
5. The normal distribution is continuous.
6. The normal curve is asymptotic (it never touches the𝑥 − 𝑎𝑥𝑖𝑠).
7. The total area under the normal distribution curve is 1.00 𝑜𝑟 100%.
8. The area under the part of a normal curve that lies within 1 standard deviation of the
mean 68%; within two standard deviations, about 95%; and with 3 standard
deviations, about 99.7%.

A. Standard Normal Distribution

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


31
A normal distribution can be converted into a standard normal
distribution by obtaining the 𝑧 − 𝑣𝑎𝑙𝑢𝑒. A 𝑧 − 𝑣𝑎𝑙𝑢𝑒 is the signed distance between
a selected value, designated 𝑥, and the mean, 𝜇, divided by the standard deviation. It
is also called as 𝑧 − 𝑠𝑐𝑜𝑟𝑒, the 𝑧 statistics, the standard normal deviates, or the
standard normal values. In terms of formula:

𝑥−𝜇
Standard normal value: 𝑧 = 𝜎

Where: 𝑧 = z value
𝑥 = the value of any particular observation or measurement
𝜇 = the mean of the distribution
𝜎 = standard deviation of the distribution
The normal distribution property allows to compute a probability problem concerning 𝑥
into one concerning 𝑧. To determine the probability that 𝑥 lies in a given interval,
converting the interval to a 𝑧 scale and then compute the probability by using the
standard normal distribution table in ( ).
Example 1: Determine the area under the standard normal distribution curve
between 𝑧 = 0 𝑎𝑛𝑑 𝑧 = 1.85.

Solution: Draw the figure and represent the area as shown in the figure below.
Since Table A gives the area between 0 and any 𝑧 − 𝑣𝑎𝑙𝑢𝑒 to the right of 0, we
only need to look up the 𝑧 value in the table. Find 1.8 in the left column and
0.05 in the top row. The value where the column and row meet in the table is
the answer, 0.4678.

Table A: Standardized Normal Distribution


𝑧 0.00 0.01 : 0.05 …
0.0 0.0000 0.0040 : 0.0199 …
0.1 0.0398 0.0438 : 0.0596 …
𝑃(0 < 𝑧 < 1.85)
0.2 0.0793 0.0832 : 0.0987 …
: : : : : …
1.8 0.4641 0.4649 : 0.4678 …
: : : : 0.0000 …

𝑃(0 < 𝑧 < 1.85) = 0.4678


0.4678

0 1.85

Hence, the area is 0,4678 𝑜𝑟 46.78%.

Example 2: Determine the area under the standard normal distribution curve
between 𝑧 = 0 𝑎𝑛𝑑 𝑧 = −1.15

Solution: The desired area is shown below.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


32
0.3749

−1.15 0

The area between 𝑧 = 0 𝑎𝑛𝑑 𝑧 = −1.15 or 𝑃(−1.15 < 𝑧 < 0) is 0.3749.


Therefore, the area is 0.3749 or 37.49%.

Example 3: Find the area under the standard normal distribution curve to the
right of 𝑧 = 1.15

The required area is at the right tail of the normal curve. Since the table gives an
area between 𝑧 = 0 𝑎𝑛𝑑 𝑧 = 1.15, first find that area.

𝑃(0 < 𝑧 < 1.15) = 0.3749

Then subtract 𝑃(0 < 𝑧 < 1.15) = 0.3749 from 0.5000, since half of the area under
the curve is to the right of 𝑧 = 0. 0.1251

𝑃(𝑧 > 1.15) = 0.5000 − 𝑃(0 < 𝑧 < 1.15)


= 0.5000 − 0.3749 0.3749 0.5000 − 0.3749 = 0.1251
= 0.1251

0 1.15

The area to the right of 𝑧 = 1.15 is 0.1251 or 12.51%

B. Application of Normal Distribution

In conjunction with the standard normal value formula, many different types of
probability problems involving normal distribution can be resolved. To illustrate this , we will deal
with some examples .
𝑥−𝜇
𝑧= 𝜎
where: 𝑧 = z- value
𝑥 = value of any particular observation or measurement
𝜇 = population mean
𝜎 = population standard deviation

The formula is used to gain information about an individual data value when the variable is
normally distributed.

Example 1: The average Pag-ibig Salary Loan for RFS Pharmacy Inc. employees is Ᵽ23,000. If
the debt is normally distributed with a standard deviation of Ᵽ2,500, find the probability that the
employee owes less than Ᵽ18,500.

Solution:
Step 1: Draw a figure and represent the area.
𝑃(𝑥 < 18,500)

Learning Module in GE 3(Mathematics in the Modern World):18,500


A Simplified
23,000 Approach Page
33
Step 2: Find the 𝑧 value for Ᵽ18,500.
𝑥 − 𝜇 18,500 − 23,000 −4,500
𝑧= = = = −1.80
𝜎 2,500 2,500

Step 3: Find the appropriate area. The area obtained in the Standardized Normal Distribution
refer to table is 0.4641, which corresponds to the area between 𝑧 = 0 and 𝑧 = −1.80.

𝑃(−1.80 < 𝑧 < 0) = 0.4641

Step 4: Subtract 0.4641 from 0.5000.

𝑃(𝑥 < 18,500) = 𝑃(𝑧 < −1.80) = 0.5000 − 𝑃(−1.80 < 𝑧 < 0) = 0.5000 − 0.4641 = 0.0359

0.0359

18,500 23,000

Hence, the probability that the employee owes less than Ᵽ18,500 in Pag-ibig salary loan is
0.0359 or3.59%.

Example 2: The average age of bank managers is 40 years. Assume the variable is normally
distributed. If the standard deviation is 5 years, find the probability that the age of a randomly
selected bank manager will be in the range between 35 and 46 years old.

Solution: Assume that ages of bank managers are normally distributed; then cut off points as
shown in the figure below.

Step 1: Draw a figure and represent the area.

35 40 46

Step 2: Find the two 𝑧 values.


𝑥 − 𝜇 35 − 40 𝑥 − 𝜇 46 − 40
𝑧= = = −1.00 𝑧= = = 1.20
𝜎 5 𝜎 5
Step 3: Find the appropriate area for 𝑧 = −1.00 and 𝑧 = 1.20, in Appendix A.
𝑃(−1.00 < 𝑧 < 0) = 0.3413 𝑃(0 < 𝑧 < 1.20) = 0.3849

Step 4: Add 𝑃(−1.00 < 𝑧 < 0) and 𝑃(0 < 𝑧 < 1.20)
𝑃(35 < 𝑥 < 46) = 𝑃(−1.00 < 𝑧 < 1.20) = 𝑃(−1.00 < 𝑧 < 0) + 𝑃(0 < 𝑧 < 1.20)

= 0.3413 + 0.3849 = 0.7262

35 40 46
Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page
34
x-value
-1.00 0 1.20 z-value

Hence, the probability that a randomly selected bank manager is between 35 and 46
years old is 0.7262 𝑜𝑟 72.62%.

CARRY OUT

Just practice solving the following problem.

1. A company produces different types of energy drinks. The filling machines are
adjusted to pour 500 ml of energy drinks into each plastic bottle. Nonetheless, the actual
amount of energy drink poured into each bottle is not exactly 500 ml. it varies from bottle to
bottle. It has been observed that the amount of energy drink in a bottle is normally distributed
with a mean of 500 ml and standard deviation of 4.75 ml. What percentage of the energy drink
bottles contains 505 to 513 ml?

CHECKPOINT

Direction: Read, analyze, and solve the following problems involving normal
distribution.

1. In a population of high school students’ algebra scores, the 𝜇 = 65, 𝜎 = 6. Find the 𝑧 −
𝑣𝑎𝑙𝑢𝑒𝑠 that correspond to a score 𝑥 = 80.

2. A Physics score is 41. If it is give that 𝜇 = 42 and 𝜎 = 8, what is the corresponding 𝑧 −


𝑠𝑐𝑜𝑟𝑒?

3. In a certain university, the students were informed that they need a grade in the top
8% of the Engineering students to get a scholarship for the next semester.
In a standardization of the test, the mean was 77 and the standard
deviation was 14. Assuming that the grade is normally distributed, what was
be the minimum grade to obtain the scholarship grant?

CONTEMPLATE ON IT
What have you learned from these lessons?

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
_______________________________.

Lesson 3 Correlation and Linear Regression

Learning Outcomes:

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


35
At the end of this lesson, you must have:
a) differentiate correlation and linear regression
b) solved the correlation and linear regression of the given data
c) constructed a scatter diagram

CATCH IT

1. How do you know that relationship exists between two variables?

2. When can we say that positive or negative relationship exists?

CONCEPTUALIZE

Correlation is a statistical method used to determine whether a relationship between


variables exists. A variable here is characteristic of the population being observed or measured.
For instance, the variable of interest might be advertising expense and sales. The sample then
consists of random observations of the variable describing a given population.

Regression analysis is a statistical method used to describe the nature of the


relationship between variables, that is, either positive or negative, linear or nonlinear. There are
two types of relationships: simple and multiple. In a simple relationship, there are two
variables- an independent variable (or explanatory variable or predictor variable) and a
dependent variable (or response variable). Simple linear relationship can be positive or negative.
A positive relationship exists when either variables increase at the same time or both
decrease at the same time. On the contrary, in a negative relationship, as one variable
increases, the other variable decreases or vice versa. The text is limited with the discussion of
simple linear regression analysis.

Scatter diagram is useful tool for checking the assumptions in a regression analysis. It
can be viewed during an initial screening run of the analysis or after the analysis. The benefit of
looking at scatter diagram residuals in the beginning stages of an analysis is that it may save a
researcher’s time. If the assumptions are not met, further screening must be applied before the
analysis can be completed and data may require cleansing and transformation. In this case, the
researcher is not running analysis haphazardly. If the assumptions are met, the regression is
ready to be run and the researcher has increased confidence that the chances of making a Type
I or Type II error are reduced, ultimately improving the accuracy of any research results.

A. Pearson Product-Moment Correlation

Pearson product-moment correlation is the most widely used in statistics to


measure the degree of the relationship between the linear related variables. The Pearson 𝑟
correlation would require both variables to be normally distributed. Correlation refers to the
departure of two random variables from independence.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


36
For example, in the stock market, if we want to measure how two products are related to each
other, Pearson 𝑟 correlation is used to measure the degree of relationship between two
products.

The correlation coefficient is defined as the covariance by the standard deviation of the
variables. The following formula is used to calculate the Pearson 𝑟 correlation.

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) 𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)


𝑟= 𝑜𝑟 𝑟=
√[∑(𝑥 − 𝑥̅ )2 ][∑(𝑦 − 𝑦)2 ] √[𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 ][𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2 ]

Pearson’s product-moment correlation coefficient of simply correlation coefficient (or


Pearson’s 𝑟) is a measure of the linear strength of the association between two variables. It is
founded by Karl Pearson. The value of the correlation coefficient varies between +1 and − 1.
When the value of the correlation coefficient lies around ±1, then it is said to be a perfect
degree of association between two variables. As the value of the correlation coefficient goes
closer to zero, the relationship between the two variables will be weaker. This information is
summarized in the chart below.
Y Variable

Y Variablee

0 5 10 15 20
X Variable
X Variable

Perfect Positive Correlation Perfect Negative Correlation


(𝑟 = 1.00) (𝑟 = −1.00)
Y Variable
Y Variable

X Variable
X Variable

Negative Correlation
Positive Correlation
(𝑟 = −0.80)
(𝑟 = 0.80)
Y Variable
Y Variable

Learning Module in GE 3(Mathematics in the Modern World): A SimplifiedXApproach


Variable Page
X Variable 37
Zero Correlation Non-Linear Correlation
(𝑟 = 0.00)

The following summarizes the correlation coefficient and strength of relationships:

0.00 no correlation, no relationship

±0.01 𝑡𝑜 ± 0.20 Very low correlation, almost negligible relationship

±0.21 𝑡𝑜 ± 0.40 Slight correlation, definite but small relationship

±0.41 𝑡𝑜 ± 0.70 Moderate correlation, substantial relationship

±0.71 𝑡𝑜 ± 0.90 High correlation, marked relationship

±0.91 𝑡𝑜 ± 0.99 Very high correlation, very dependable relationship

±1.00 Perfect correlation, perfect relationship

A test of significance for the coefficient of correlation may be used to find out if the
computed Pearson′ s 𝑟 could have occurred in a population in which the two variables are
related or not. The test statistics follows the 𝑡 distribution with 𝑛 − 2 degrees of
freedom. The significance is computed using the formula:

𝑟√𝑛−2
𝑡= Where: 𝑡 = t − test for correlation coefficient
√1−𝑟 2
𝑟 = correlation coefficient
𝑛 = number of paired samples

Assumptions in Pearson Product- Moment Correlation test:


1. Subjects are randomly selected.
2. Both populations are normally distributed.

Procedure for Pearson Product-Moment Correlation test:

1. Set up the hypotheses:


𝐻0 : 𝜌 = 0 (The correlation in the population is zero)
𝐻1 : 𝜌 ≠ 0, 𝜌 > 0, 𝜌 < 0 (The correlation in the population is different from zero)
Where: 𝜌 = correlation in the population
2. Set the level of significance.
3. Calculate the degrees of freedom (𝑑𝑓 = 𝑛 − 2) and determine the critical value of 𝑡.
4. Calculate the value of Pearson’s 𝑟.
5. Calculate the value of 𝑡 value and determine the statistical decision for hypothesis testing.
If 𝑡𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑑 < 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 , do not reject 𝐻0 .
If 𝑡𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑑 ≥ 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 , reject 𝐻0 .
6. State the conclusion.
The test for correlation coefficient is two-tailed; the rejection region is divided into
two equal parts (i.e. we divided 0.05 into two equal parts of 0.025 each). The figure
below illustrates the rejection and non-rejection region of the test of hypothesis of
correlation coefficient.
Testing the Hypothesis of Correlation Coefficient at 0.05 Significance Level

Rejection Region Rejection Region


(There is correlation) Non-rejection (There is correlation)
Region (No
Learning Module in GE 3(Mathematics incorrelation
the Modernin World): A Simplified Approach Page
population) 38
When the null hypothesis has been rejected for a specific significance level, there are
possible relationships between 𝑥 and 𝑦 variables.

1. There is a direct cause- and –effect relationship between the two variables.
2. There is a reverse cause- and –effect relationship between the two variables.
3. The relationship between the two variables may be caused by the third variable.
4. There may be a complexity of interrelationship among many variables.
5. The relationship between the two variables may be coincidental.

Example 1: The owner of a chain of fruit shake stores would like to study the correlation
between atmospheric temperature and sales during the summer season. A random sample of 12
days is selected with the results given as follows:

Day 1 2 3 4 5 6 7 8 9 10 11 12
Temperature 79 76 78 84 90 83 93 94 97 85 88 82
(℉)
Total Sales 147 143 147 168 206 155 192 211 209 187 200 150
(units)

Plot the data on a scatter diagram. Does it appear there is a relationship between
atmospheric temperature and sales? Compute the coefficient of correlation. Determine at the
0.05 significance level whether the correlation in the population is greater than zero.

Solution:

Step 1: Graph the scatter plot.

250

200

150
Sales (Y)

100

50

0
75 85 95 105
Temperature (X)

Step 2: State the hypotheses.


𝐻0 : 𝑟 = 0 (There is no correlation between atmospheric temperature and total sales of fruit
shake.)
𝐻1 : 𝑟 = 0 (There is a correlation between atmospheric temperature and total sales of fruit
shake.)

Step 3: The level of significance is 𝛼 = 0.05

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


39
Step 4: Determine the degrees of freedom and the critical values of 𝑡. Refer to Appendix B

𝑑𝑓 = 𝑛 − 2 = 12 − 2 = 10 𝑎𝑛𝑑 𝑡 = ±2.228
Step 5: Compute for the value of 𝑟 (Pearson Product- Moment Correlation Coefficient)

Day 𝑥 𝑦 𝑥2 𝑦2 𝑥𝑦
1 79 147 6,241 21,609 11,613
2 76 143 5,776 20,449 10,868
3 78 147 6,084 21,609 11,466
4 84 168 7,056 28,224 14,112
5 90 206 8,100 42,436 18,540
6 83 155 6,889 24,025 12,865
7 93 192 8,649 36,864 17,856
8 94 211 8,836 44,521 19,834
9 97 209 9,409 43,681 20,273
10 85 187 7,225 34,969 15,895
11 88 200 7,744 40,000 17,600
12 82 150 6,724 22,500 12,300
Total 1,029 2,115 88,733 380,887 183,222

∑ 𝒙 = 1,029 ∑ 𝑦 = 2,115 ∑ 𝒙𝟐 = 88,733 ∑ 𝒚𝟐 = 380,887 ∑ 𝒙𝒚 = 183,222

𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑟=
√[𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 ][𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2 ]

12(183,222) − (1,029)(2,115)
𝑟=
√[12(88,733) − (1,029)2 ][12(380,887) − (2,115)2 ]
22,329
= = 0.92700572554 ≈ 0.93
√[5,955][97,419]

The coefficient of correlation, 𝑟 = 0.93, between the atmospheric temperature and


total sales indicates a very high positive correlation(very dependable relationship)- that
is an increase in atmospheric temperature is highly associated with the increased in
total sales of fruit shake.

Step 6: Decision Rule:


In order to make a decision on the significant relationship, we need to determine the
value of 𝑡.
𝑟√𝑛 − 2 0.93√12 − 2 0.93(3.16227766) 2.940918224
𝑡= = = = ≈ 8.00
√1 − 𝑟 2 √1 − (0.93)2 √1 − 0.8649 0.367559519

Since the computed 𝑡 − value of 8.00 is greater than the tabular value of 2.228 at
level of significance of 0.05, we would need to reject the null hypothesis.

Step 7: Conclusion:
Since the null hypothesis has been rejected, we can conclude that there is evidence
that shows significant association between the atmospheric temperature and the total
sales of fruit shake.

B. Simple Linear Regression Analysis

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


40
Regression analysis is a simple statistical tool used to model the dependence of a
variable on one (or more) explanatory variables. This functional relationship may then be
formally stated as an equation, with associated statistical values that describe how well this
equation fits the data.

A simple linear regression is the least estimator of a linear regression model with a
single predictor (or one independent variable). The least square model determines a
regression equation by minimizing the sum of squares of the vertical distances between the
actual 𝑦 values and the predicted values of 𝑦. Meaning, simple regression fits a straight line
through the set of 𝑛 points in such a way that makes the sum of squared residuals of the model
as small as possible. This method gives what is generally known as the "𝑏𝑒𝑠𝑡 − 𝑓𝑖𝑡𝑡𝑖𝑛𝑔" line. The
difference between an observed and predicted value is called the residual. The mean of the
residuals is always zero. The points that fall outside the overall pattern of the points are known
as outliers.
In a scatter plot, there are scores whose removal greatly changes the regression line
which are called influential scores. In some cases, these scores are restricted to points with
extreme 𝑥 − 𝑣𝑎𝑙𝑢𝑒𝑠. Some influential scores may have a small residual but still have a greater
effect on the regression line than scores with possibly larger residuals but average 𝑥 − 𝑣𝑎𝑙𝑢𝑒𝑠.

𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)


𝑦̂ = 𝑏1 𝑥 + 𝑏0 𝑏1 = 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2

Where: 𝑦̂ = predicted or fitted value of 𝑦


𝑥 = the value of any particular observation of the independent variable
𝑦 = the value of any particular observation of the dependent variable
𝑏1 = slope of the regression line
𝑏0 = intercept of the regression line
𝑥̅ = mean of the independent variable
𝑦̅ = mean of the dependent variable

Example2: Referring to the Example 1 involving atmospheric temperature on sales, determine


the regression equation, plot the regression line and interpret it.

Solution:
Computation of the Simple Linear Regression Equation
Step 1: Obtain the sum of 𝑥, 𝑦, 𝑥 2 , 𝑦 2 𝑎𝑛𝑑 𝑥𝑦.
∑ 𝑥 = 1,029 ∑ 𝑥 2 = 88,733 ∑ 𝑥𝑦 = 183,222
∑ 𝑦 = 2,115 ∑ 𝑦 2 = 380,887

Step 2: Compute for slope of the simple linear regression.


𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)
𝑏1 =
𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2

12(183,222) − (1,029)(2,115) 2,198,664 − 2,176,335 22,329


𝑏1 = = = = 3.7496
12(88,733) − (1,029)2 1,064,796 − 1,058,841 5,955

Step 3: Compute for the mean value of 𝑥 and 𝑦.


∑𝑥 1,029 ∑𝑦 2,115
𝑥̅ = 𝑛
= 12
= 85.75 𝑦̅ = 𝑛
= 12
= 176.25

Step 4: Compute for intercept of the simple linear regression.


𝑏0 = 𝑦̅ − 𝑏1 𝑥̅ = 176.25 − 3.7496(85.75) = 176.25 − 321.5282 = −145.2782

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


41
Step 5: Substitute the slope and intercept in the general simple linear regression equation.
𝑦̂ = 𝑏1 𝑥 + 𝑏0 General Equation for Simple Linear Regression
The Simple Linear Regression is 𝑦̂ = 3.7496𝑥 − 145.2782

Step 6: Graph the least square regression line.

250
y = 3.7496x - 145.2728
200

150
Sales (Y)

100

50

0
75 80 85 90 95 100
Temperature (X)

Thus, the regression equation is 𝑦̂ = 3,7496𝑥 − 145.2582. The 𝑏1 of 3.7496 indicates that for
each additional temperature in Fahrenheit, sales are expected to increase by 3.7496 units. The
𝑏0 = value of −145.2782 indicates that the intercept with the 𝑦 − 𝑎𝑥𝑖𝑠 is below the origin. A
concrete interpretation is that if the temperature in Fahrenheit is zero, a negative 145.2782 units
would be sold.

CARRY OUT

Use the given problem below to practice solving the about correlation and regression.

1. A random sample of nine (9) cities gave the following figures for annual per capita of cigarette
consumption and annual death rate from lung cancer.

City 1 2 3 4 5 6 7 8 9
Cigarette 350 370 250 260 255 300 400 330 240
Consumption (x)
Death Rate (y) 21 24 17 18 17 19 25 20 16

a. Calculate the sample correlation 𝑟. At 0.01 level of significance, test whether cigarette
consumption and lung cancer are unrelated.
b. Determine the regression line.

CHECKPOINT

Direction: Read, analyze, and solve the following problems.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


42
1. The following table shows the scores on a clerical aptitude test and grades in a clerical skills
course for 11 business administration students.

Aptitude Test Score (x) 71 77 67 76 60 72 95 85 91 83 71


Skills Course Score (y) 73 85 74 82 70 83 92 85 97 89 79

a. Sketch the data using scatter diagram.


b. Determine whether the data provide sufficient evidence to indicate that clerical aptitude test
score and clerical skills course score are linearly related using 0.01 level of significance.
c. Find the regression line.

2. The city engineer wants to establish the relationship between household size and monthly
household water consumption. Given the data in the table, determine the following:
a. Plot the data in scatted plot and find the correlation 𝑟.
b. Determine whether we can conclude from these data that the two variables are linearly
related at 0.05 level of significance.
c. Find the regression line.

Household size (x) 3 3 4 9 8 7 6 5 10


Gallons of Water Used (y) 655 700 500 800 600 900 570 450 1,200

CONTEMPLATE ON IT
What have you learned from these lessons?
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________

References:
Sirug, WS. (2018). Mathematics in the Modern World. A CHED General Education Curriculum
Compliant. Mindshapers Co.Inc. R. 108, Intramuros Corporate Plaza Bldg., Recoletos St.
Intramuros, Manila Philippines

Sirug, WS. (2015).Basic Probability and Statistics. A Step by Step Approach Revised Edition.
Mindshapers Co.Inc. R. 108, Intramuros Corporate Plaza Bldg., Recoletos St. Intramuros,
Manila Philippines

Amid, DM. (2005). Fundamentals of Statistics. Lorimar Publishing Co.Inc. 776 Aurora Blvd.,
cor. Boston Street, Cubao, Quezon City, Metro Manila.

Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page


43

You might also like