Professional Documents
Culture Documents
Module 4 Data Management
Module 4 Data Management
Data
6
4
2 Series 1
Management
0 Series 2
Series 3
4
Sales 3
2
1
0
0 1 2 3
Module Overview
In today’s world we process enormous amount of data almost every day. In schools,
laboratories, and companies, volumes of data are processed. Data management plays an
important role in processing this data. To help analyze a certain phenomenon, we need to
manage data with the help of statistics. When the data are managed efficiently, it results to
understanding the nature of such phenomenon. This will further improve the lives in the modern
world. Obtaining and presenting data is a very challenging job. It demands a careful way of
handling it. A person who collects data must be very precise, accurate, and open-minded
creative in presenting the data he obtained.
Objectives:
CATCH IT
How do we make a frequency distribution table?
What is the purpose of frequency distribution table?
How do we find the midpoint in the frequency distribution table?
CONCEPTUALIZE
A. Organization of Data
When conducting a statistical research, investigation or study, the research must
gather data for the particular variable under investigation. To describe situations, make
conclusions, and draw inferences about events, the researcher must organize the data gathered
in some meaningful ways. The easiest way and widely used of organizing data is to construct a
frequency distribution. A frequency distribution is the grouping of the data into categories
showing the number of observations in each of the non- overlapping classes.
After organizing data, the next move of the researcher is to present the data so they can
be understood easily by those who will benefit from reading the study. The most useful method
of presenting data is by constructing graphs and charts. There are number of ways to plot
graphs and charts, and each one has a specific purpose.
This section discussed how to organize data by constructing frequency distribution and
how to present data by constructing graphs and charts. Before we get started in constructing
frequency distribution, we must define some terms that are essential to understand deeper the
nature of data that are displayed in a frequency distribution.
High
Average
Low
High IIII-II
Average IIII-III
Low IIII
High IIII-II 7
Average IIII-III 8
Low IIII 5
Total 20 100
Generally, the number of classes for a frequency distribution table varies from 5
to 20, depending primarily on the number of observations in the data set. It is preferred to have
more classes as the size of the data set increases. The decision about the number of classes
depends on the method used by the researcher.
1. Rule 1. To determine the number of classes is to use the smallest positive integer 𝑘
such that 2𝑘 ≥ 𝑛, where 𝑛 is the total number of observations.
𝑅𝑎𝑛𝑔𝑒 𝐻𝑉 − 𝐿𝑉
𝑆𝑢𝑔𝑔𝑒𝑠𝑡𝑒𝑑 𝐶𝑙𝑎𝑠𝑠 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑘
Where: 𝐻𝑉 = Highest value in a data set
𝐿𝑉 = Lowest value in a data set
𝑘 = number of classes
𝑖 = suggested class interval
2. Rule 2. Another way to determine the class interval is by applying the formula:
𝑅𝑎𝑛𝑔𝑒
𝑆𝑢𝑔𝑔𝑒𝑠𝑡𝑒𝑑 𝐶𝑙𝑎𝑠𝑠 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 =
1 + 3.22 (𝑙𝑜𝑔𝑎𝑟𝑖𝑡ℎ𝑚 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑖𝑒𝑠
17,400 32,400 20,200 21,300 26,200 22,750 24,600 27,300 23,500 29,500
14,000 30,500 17,950 20,250 24,750 21,750 23,700 26,500 22,900 27,500
15,500 30,700 18,400 20,400 25,000 21,900 23,850 26,800 23,000 27,800
17,300 32,100 20,000 21,000 26,100 22,600 24,500 27,000 23,400 29,300
15,700 30,700 18,700 20,500 25,150 21,900 24,100 26,900 23,200 27,900
14,300 30,650 18,350 20,300 25,000 21,800 23,700 26,500 22,900 27,600
17,000 30,750 18,800 20,800 26,000 22,000 24,300 27,000 23,400 27,900
17,800 33,500 20,250 21,600 26,300 22,800 24,700 27,400 23,700 30,400
Solution:
Step 1: Arrange the raw data in ascending or descending order. In this particular example
we will arrange raw data in ascending order. This will make it easier for us to tally the data.
14,000 17,950 20,250 21,750 22,900 23,700 24,750 26,500 27,500 30,500
14,300 18,350 20,300 21,800 22,900 23,700 25,000 26,500 27,600 30,650
15,500 18,400 20,400 21,900 23,000 23,850 25,000 26,800 27,800 30,700
15,700 18,700 20,500 21,900 23,200 24,100 25,150 26,900 27,900 30,700
17,000 18,800 20,800 22,000 23,400 24,300 26,000 27,000 27,900 30,750
17,300 20,000 21,000 22,600 23,400 24,500 26,100 27,000 29,300 32,100
17,400 20,200 21,300 22,750 23,500 24,600 26,200 27,300 29,500 32,400
17,800 20,250 21,600 22,800 23,700 24,700 26,300 27,400 30,400 33,500
The objective is to use just enough classes. We can determine the number of classes (𝑘)
using "𝟐 𝒕𝒐 𝒕𝒉𝒆 𝒌 𝒓𝒖𝒍𝒆”. This will enable us to select the smallest number (𝑘) for the number
of classes such that 2𝑘 (2 𝑟𝑎𝑖𝑠𝑒𝑑 𝑡𝑜 𝑡ℎ𝑒 𝑝𝑜𝑤𝑒𝑟 𝑜𝑓 𝑘) is greater than the number of observation
(𝑛).
Using our example, there are 80 young professionals(𝑜𝑟 𝑛 = 80). If we apply 𝑘 = 6,
which means we would use 6 classes, then 2𝑘 = 26 = 64, somewhat less than 80. Thus, 6 is not
enough classes. If we try 𝑘 = 7, then 2𝑘 = 27 = 128, this is greater than 80. Therefore, the
recommended number of classes is 7.
Generally, the class interval (or width) should be equal for all classes. The classes must
cover all the values in the raw data (that is, from lowest to highest). Class interval is generated
using the formula:
𝑅𝑎𝑛𝑔𝑒 𝐻𝑉 − 𝐿𝑉 19,500
𝑆𝑢𝑔𝑔𝑒𝑠𝑡𝑒𝑑 𝐶𝑙𝑎𝑠𝑠 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = = = = 2,785.7 ≈ 2,800
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑘 7
Note: Round the value of the interval up to the nearest whole number if
there is a remainder.
We need to add the interval (or width) to the lowest score taken as the starting point to
obtain the lower limit of the next class. Keep adding until we reach the 7 classes, as reflected
14,000; 16,800; 19,600; 22,400; 25,200; 28,000 and 30,800.
To obtain the upper class limits, we first need to add the interval to the lower limit of the
class to obtain the upper limit of the first class. That is, 14,000 + 2,800 = 16,800. Then, add
the interval (or width) to each lower limit to obtain all upper limits.
Class Limits
14,000 < 16,800
16,800 < 19,600
19,600 < 22,400
22,400 < 25,200
25,200 < 28,000
28,000 < 30,800
30,800 < 33,600
Step 5: Determine the relative frequency. It can be found by dividing each frequency by
the total frequency.
Step 7: Determine the cumulative frequencies. The cumulative frequency can be found
by adding the frequency in each class to the total frequencies of the classes
preceding that class.
Frequency Polygon: A frequency polygon is a graph that displays the data using
points which are connected by lines. The frequencies are represented by the heights of
the points at the midpoints of the classes. The vertical axis represents the frequency of
the distribution while the horizontal axis represents the midpoints of the frequency
distribution.
Solution:
a. Constructing a Histogram
Step 3: Represent the frequency on the 𝑦 − 𝑎𝑥𝑖𝑠 and the midpoints on the 𝑥 − 𝑎𝑥𝑖𝑠.
Step 4: Use the frequency to represent the height and draw the vertical bars.
20
Frequency
15
10
0
15 18 21 24 27 30 33
Salary (in Thousands)
Frequency
15
10
5
0
15 18 21 24 27 30 33
Salary (in Thousands)
60
50
40
30
20
10
0
16.5 19.5 22.5 25.5 28.531.5 34.5
CARRY OUT
CHECKPOINT
I. Direction. Use the given data below to answer what is being asked.
1. A marketing research consultant conducted a survey of 40 persons who used
to visit fast food chains in one morning. The age of the persons was recorded
to the nearest year as follows.
16 29 44 36 40 24 28 47 34 46 35 26
50 33 38 19 22 53 44 55 32 21 44 41
19 40 30 47 47 27 50 33 46 48 29 27
32 31 42 28
Prepare a frequency distribution table using Rule 1 and Rule 2.
2. The daily number of machine copies made by Estaya copying center are
grouped into a table having the classes 0 − 49,50 − 99,100 − 149, 𝑎𝑛𝑑 150 −
199. Find
a) the class boundaries
b) the class midpoints
c) the class interval
CONTEMPLATE ON IT
List down the concepts that you have learned from this lesson.
Related Readings:
https://www.statisticshowto.com/probability-and-statistics/descriptive-
statistics/frequency-distribution-
https://www.statisticshowto.com/probability-and-statistics/descriptive-statistics/
References:
Soriano, JM. (2019). Mathematics in the Modern World. Books Atbp. Publishing
Corp. 707 Tiaga cor. Kasipagan St., Barangka Drive, Madaluyong City
Philippines.
Learning Outcomes:
At the end of this lesson, you must have:
1. Discussed the properties of the three different measures of central tendency: the
mean, median and the mode;
2. Computed the measures of central tendency;
3. Manifested appreciation of the applications of the measures of central tendency in
daily life situation.
CATCH IT
The following are the scores of ten students in a 20 items math quiz:
CONCEPTUALIZE
a. How did you compute the average score? What greatly affects the mean?
b. What do we call the middle most score? What affects this value?
c. How many scores appeared frequently? What are they? What do we call
them? Would it be possible that you can easily determine the mode?
I. MEAN
The arithmetic mean, often called as the mean, is the most frequently used
measure of central tendency. The mean is the only common measure in which all values
play an equal role, meaning to determine its values you would need to consider all the
values of any given data set. The mean is appropriate to determine the central tendency
of an interval or ratio data.
The symbol” x “bar is used to represent the mean of a sample and the symbol “µ” is used
to denote the mean of a population.
Example 1: The daily rate of a sample of eight employees at GMS Inc. are
₱550,₱420, ₱560, ₱500, ₱700, ₱670, ₱860, ₱480. Find the mean daily rate of
employees.
Solution:
∑ 𝑥 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 + 𝑥5 + 𝑥6 + 𝑥7 + 𝑥8
𝑥̅ = =
𝑛 𝑛
∑ 𝑥 550 + 420 + 560 + 500 + 700 + 670 + 860 + 480 4,740
𝑥̅ = = = = 592.50
𝑛 8 8
𝑓 = frequency
𝑥 = midpoint
∑ 𝑓𝑥 = sum of all the products of 𝑓 and 𝑥 ′ 𝑠.
Solution:
Step1. Determine the midpoints on each class limit.
Step 2. Multiply each class frequency (f ) with the corresponding midpoint
(x) to obtain the product of fx.
Step 3. Get the sum of the product of fx.
Step 4. Apply the formula to obtain the value of the sample mean.
∑ 𝑓𝑥 2,459
𝑥̅ = = = 49.18
𝑛 50
Thus, the mean age of the frequency distribution of people taking travel is
49.18
𝑤𝑖 = corresponding weight
Solution:
∑𝑛𝑖=1 𝑤𝑖 𝑥𝑖 𝑤1 𝑥1 + 𝑤1 𝑥1 … + 𝑤𝑛 𝑥1
𝑥̅𝑤 = =
∑𝑛𝑖=1 𝑤𝑖 𝑤1 + 𝑤2 + ⋯ 𝑤𝑛
(18)(30,500) + (12)(33,700) + (7)(38,600) + (3)(45,000) 1,358,600
𝑥̅𝑤 = =
18 + 12 + 7 + 3 40
= 33,965
B. Geometric Mean
geometric mean, the first is to average percents, indexes, and relatives, the
second is to establish the average percent increase in production, sales, or other
business transactions or economic series from one period of time to another.
Formulas:
𝑛
(1) 𝐺𝑀 = √(𝑥1 )(𝑥2 )(𝑥3 ) … 𝑥𝑛
Example 2:
Example 3:
Badminton as a sport grew rapidly in 2015. From January to December 2015 the
number of badminton clubs in Metro Manila increased from 20 to 155. Compute the
mean monthly percent increase in the number of badminton clubs.
Solution:
Note that 12 months are involved. However, there are only 11 monthly rates of change.
That is, we compute the changes from January to February, from February to March,
March to April, April to May, and so forth. So 𝑛 is 12 and 𝑛 − 1 = 11 monthly percent
increases.
12−1 155 11
𝐺𝑀 = √ − 1 = √7.75 − 1 = 0.2046
20
Hence, badminton clubs are increasing at a rate of almost 0.2046 𝑜𝑟 20.46% per
month.
II. MEDIAN
The median is the midpoint of the data array. When the data set is
ordered, whether ascending or descending, it is called a data array. Median is
an appropriate measure of central tendency for data that are ordinal or above,
but it is more valuable in an ordinal type of data.
A. Properties of Median
1. The median is unique, there is only one median for a set of data.
2. The median is found by arranging the set of data from lowest to highest (highest
to lowest) and getting the value of the middle observation.
3. Median is affected by the number of values.
4. Median can be applied for ordinal, interval and ratio data
5. Median is most appropriate in a skewed data.
𝑛+1
Step 2: Select the middle rank value using the formula: (𝑅𝑎𝑛𝑘 𝑉𝑎𝑙𝑢𝑒) = 2
𝑛+1 9+1 10
Median (𝑅𝑎𝑛𝑘 𝑉𝑎𝑙𝑢𝑒) = 2
= 2
= 2
=5
5th
Hence, the median age is 53 years.
Example 2: The daily rates of a sample of eight employees at GMS Inc. are
Ᵽ550, Ᵽ420, Ᵽ560, Ᵽ500, Ᵽ700, Ᵽ670, Ᵽ860, Ᵽ480. Find the median daily rate of
employees.
Solution:
Step 1: Arrange the data in order.
𝑛+1
Step 2: Select the middle rank value using the formula: (𝑅𝑎𝑛𝑘 𝑉𝑎𝑙𝑢𝑒) = 2
𝑛+1 8+1 9
Median (𝑅𝑎𝑛𝑘 𝑉𝑎𝑙𝑢𝑒) = 2
= 2
= 2 = 4.5
4.5th
Since the middle point falls between Ᵽ550 𝑎𝑛𝑑 Ᵽ560, we can determine the
median of the data set by getting the average of the two values.
550+560 1,110
Median = = = 555
2 2
III. Mode
The mode is the value in a data set that appears most frequently. Like the
median, and unlike the mean, extreme values in a data set do not affect the
mode. A data may not contain any mode if none of the values are “most typical”.
A data set that has only one value that occurs the greatest frequency is
called the unimodal. If the data has two values with the same greatest
frequency, both values are considered the mode and the data set is bimodal. If
the data set has more than two modes, then the data set is said to be
multimodal. There are some cases when a data set value has the same number
frequency. When this occurs, the data set is said to be no mode.
A. Properties of Mode
Example 1: The following data represent the total unit sales for smart phones from
a sample of 10 communication centers for the month of August:
15, 17, 10, 12, 13, 10, 14, 10, 8, 𝑎𝑛𝑑 9. Find the mode.
Solution: The ordered array for these data is 8, 9,10, 10, 10, 12, 13, 14, 15, 17, since 10
appear 3 times more times thnan the other, therefore the mode is 10.
There are two modes 20 and 25, since each of these values occurs four times.
CARRY OUT
Solve: A sales person records the following daily expenditures during a ten day trip:
₱233.04 ₱198.75 ₱166.85 ₱343.60 ₱201.50
₱527.92 ₱455.36 ₱354.72 ₱198.75 ₱ 769.25
1. What is the mean of the given data?
2. What is the median of the given data?
3. What data occurs most frequently?
4. Suppose the salesperson had another 2-days trip with a total expenditure of
₱345.00, what will be the new mean expenditure?
5. What will be the new median?
CHECKPOINT
Direction: Compute the mean, median, and mode of the following problems.
1. A college professor administered a unit exam to one of his classes and found that the majority
of the items were too easy. The scores are:
45, 39, 40, 48, 35, 37, 36, 37, 40, 44, 41, 49, 29, 8, 32, 36, 37, 41, 40, 36, 39, 30, 25, 43, 𝑎𝑛𝑑 50.
200, 1 200, 300, 350, 500, 550, 1 500, 400, 500, 800, 850, 1 300, 2 000, 2 100, 340, 760, 830, 2 670, 990
𝑎𝑛𝑑 3 000 .
CONTEMPLATE ON IT
List down the concepts that you have learned from this lesson.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
___________________________________________________________________
REFERENCES:
Soriano, JM. (2019). Mathematics in the Modern World. Books Atbp. Publishing Corp.
707 Tiaga cor. Kasipagan St., Barangka Drive, Madaluyong City Philippines.
Sirug, WS. (2015), Basic Probability and Statistics, A Step by Step Approach. Mind
shapers Co., Inc. Rm. 108, Intramuros Corporate Plaza Bldg., Recoletos St.
Intramuros Manila, Philippines.
Learning Outcomes:
At the end of this lesson, you must have:
a) Computed the measures of variability such as range, quartile deviation, mean absolute
deviation variance and standard deviation of a given data set.
b) Discussed the properties of the different measures of variability.
c) Appreciated the application of the range and standard deviation in the analysis of data.
CATCH IT
CONCEPTUALIZE
Another important characteristic of data set is how it is distribute, or how far each
element is from some measure of central tendency.
A. Range
The range is the simplest and easiest way to determine measure of dispersion. It is the
difference of the highest value and the lowest value in the data set. There are two advantages
of the range: (i) it is easy to compute and (ii) it is easy to understand.
On the other hand, it also has two disadvantages, (i) it can be distorted by a single extreme
value (or outlier) and (ii) only two values are used in the calculation.
Properties of Range
1. It is a quick but rough measure of dispersion.
2. The larger the value of the range, the more dispersed is the observation.
3. It considers only the lowest and the highest values in the population.
Example 1: The daily rates of a sample of eight employees at GMS Inc. are
Ᵽ550, Ᵽ420, Ᵽ560, Ᵽ500, Ᵽ700, Ᵽ670, Ᵽ860, Ᵽ480. Find the range.
Solution:
Step 1: Determine the highest value and the lowest value in the data set.
𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒 (𝐻𝑉) = Ᵽ860 𝐿𝑜𝑤𝑒𝑠𝑡 𝑉𝑎𝑙𝑢𝑒(𝐿𝑉) = Ᵽ420
Step 2: Solve for the range.
𝑅𝑎𝑛𝑔𝑒 = 𝐻𝑉 − 𝐿𝑉 = Ᵽ860 − Ᵽ420 = Ᵽ440
The range in daily rate salary is Ᵽ440.
One of the most widely used measures of dispersion is the standard deviation.
The more spread apart the data, the higher the deviation. Standard deviation is calculated as
the square root of the variance. In finance, standard deviation is applied to the annual rate of
return of an investment to measure the investment’s volatility. Standard deviation is also known
as historical volatility and is used by investors as a gauge for the amount of expected volatility.
A measure of the dispersion of a set of data points around their mean value.
Variance is a mathematical expectation of the average squared deviations from the mean.
Properties of Variance
(∑ 𝑥)2 2 −(∑ 𝑥)
2
∑ 𝑥2− ∑𝑥
2
𝑠 = 𝑛
𝑠= √ 𝑛
𝑛−1 𝑛−1
Example 2: The daily rates of a sample of eight employees at GMS Inc. are
Ᵽ550, Ᵽ420, Ᵽ560, Ᵽ500, Ᵽ700, Ᵽ670, Ᵽ860, Ᵽ480. Find the variance and standard deviation.
Solution:
Step 2: Subtract the mean from each of the value in the data set.
𝑥 𝑥 − 𝑥̅
550 −42.5
420 −172.5
560 −32.5
500 −92.5
700 107.5
670 77.5
860 267.5
480 −112.5
∑ 𝑥 = 4 740 ∑(𝑥 − 𝑥̅ ) = 0
Step 3: Square the 𝑥 − 𝑥̅ , then get the sum.
Step 4: Solve for the variance and the standard deviation. We can obtain the standard deviation
by simply extracting the square root of the variance.
(∑ 𝑥)2 2 (∑ 𝑥)
2
∑ 𝑥2− ∑𝑥 −
2
𝑠 = 𝑛
𝑠 = √ 𝑛−1 𝑛
𝑛−1
𝑥
550
420
560
500
700
670
860
480
∑ 𝑥 = 4,740
Step 2: Square the values in the data set and get the sum.
𝑥 𝑥2
550 302, 500
420 176,400
560 313,600
500 250,000
700 490,000
670 448,900
860 739,600
480 230,400
∑ 𝑥 = 4,740 ∑ 𝑥 2 = 2,951,400
Step 3: Solve for the values of the variance and standard deviation.
(∑ 𝑥)2 (4,740)2
∑ 𝑥2− 2,951,400− 2,951,400−2,808,450
2 𝑛 8
𝑠 = 𝑛−1
= 8−1
= 7
= 20,421.43
2 (∑ 𝑥)2 (4,740)2
√∑ 𝑥 − 𝑛 √2,951,400 − 8 2,951,400 − 2,808,450
𝑠= = =√ = √20,421.43 = 142.90
𝑛−1 8−1 7
Thus, the variance is Ᵽ20,421.43 and the standard deviation is Ᵽ142.90
Example 3: The monthly incomes of the five research directors of Recoletos schools are:
Ᵽ55, 000, Ᵽ 59,500, Ᵽ62,500, Ᵽ57,000, Ᵽ61,000. Find the variance and standard deviation.
Solution:
Step 2: Subtract the population mean from each of the value in the data set.
𝑥 𝑥−𝜇
55,000 −4,000
59,500 500
62,500 3,500
57,000 −2,000
61,000 2,000
𝑥 𝑥−𝜇 (𝑥 − 𝜇)2
55,000 −4,000 16,000,000
59,500 500 250,000
62,500 3,500 12,250,000
57,000 −2,000 4,000,000
61,000 2,000 4,000,000
∑ 𝑥 = 295,000 ∑(𝑥 − 𝜇) = 0 ∑(𝑥 − 𝜇)2 = 36,500,000
Step 4: Solve for the population variance and population standard deviation.
∑(𝑋 − 𝜇)2
𝜎=√ = √730,000 = 2,701.85
𝑁
Hence, the population variance is 730,000 and the population standard deviation is 2,701.85
CARRY OUT
CHECKPOINT
Direction: Read, analyze, and compute the measuresof dispersion the following data.
1. The following data give the weight (in pounds) gain by 8 employees at the end of the
Christmas gatherings. Complete the table and compute the variance and standard
deviation. 2
𝑥 𝑥
11
9
10
6
5
4
8
7
𝑥 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2
75
60
65
62
80
83
91
78
CONTEMPLATE ON IT
List down the concepts that you have learned from this lesson.
CATCH UP
CONCEPTUALIZE
When presenting or analyzing data set it is sometimes helpful to group subjects into
several equal groups. To create four equal groups we need the values that split the data such
that 25% of the observations are in each group. The cut off point re called quartiles, the
general term or such cut off points is quantiles. Deciles which split data into 10 parts,
percentiles which split the data into 100 parts.
Values such as quartiles can also be expressed as percentiles; for example, the lowest
quartile is also the 25th percentile and the median is the 50th percentile or the 5th decile.
A. Quartiles
𝑘(𝑁+1)
𝑄𝑘 =
4
Where: 𝑄𝑘 = quartile
𝑁 = population
𝑘 = quartile location
Example 1: Find the first, second, and third quartiles f the ages of 9 middle- management
employees of a certain company. The ages are 53, 45, 59, 48, 54, 46, 51, 58, 𝑎𝑛𝑑 55.
Solution:
Step 1: Arrange the data in order.
45, 46, 48, 51, 53, 54, 55, 58, 59
Since the 2.5th falls between 46 and 48; and 7.5th falls between 55 and 58 we can
determine the first and third quartiles of the data set by getting the average of the two
values.
46+48 94 55+58 113
𝑄1 = = = 47 𝑄3 = = = 56.5
2 2 2 2
Therefore, 𝑄1 = 47, 𝑄2 = 53 𝑎𝑛𝑑 𝑄3 = 56.5
B. z-Score
𝒛 − 𝒔𝒄𝒐𝒓𝒆 is used to know the position of one observation relative to others in a set of
data we apply 𝑧 − 𝑠𝑐𝑜𝑟𝑒. For example, we want to know a score of a student of 42 compared to
the scores of the other students in the class based from a quiz on a total of 50 points. The mean
and the standard deviation of the scores can be used to compute the 𝑧 − 𝑠𝑐𝑜𝑟𝑒, which will
measure the relative standing of a measurement in a data set.
A 𝑧 − 𝑠𝑐𝑜𝑟𝑒 measures the distance between an observation and the mean, measured in
units of standard deviation. The following formulas show how to compute the 𝑧 − 𝑠𝑐𝑜𝑟𝑒 for a
data value 𝑥 in a population and in a sample.
𝑥−𝜇
𝑧 = 𝜎 (for population)
Example 1: The monthly expenditures of a large group of households are normally distributed
with a mean of Ᵽ48, 700 and a standard deviation of Ᵽ10,400. what is the 𝑧 − 𝑣𝑎𝑙𝑢𝑒nof monthly
expenditures of Ᵽ59,400 and Ᵽ38,300?
Solution:
Let 𝜇 = 48,700 𝜎 = 10,400
Using the formula of 𝑧 to determine 𝑧 − 𝑣𝑎𝑙𝑢𝑒𝑠 for the two 𝑥 values (Ᵽ59,400 and Ᵽ38,300) are
computed as follows:
(𝑥−𝜇) 59,400−48,700
For 𝑥 = 59,400: 𝑧= 𝜎
= 10,400
= 1.00
(𝑥−𝜇) 38,300−48,700
For 𝑥 = 38,300; 𝑧= 𝜎
= 10,400
= −1.00
The 𝑧 of 1.00 indicates that the monthly expenditures of Ᵽ59,400 for households is one standard
deviation above the mean, and a 𝑧 of −1.00 shows that a Ᵽ38,300 monthly expenditures is one
standard deviation below the mean. Note that both household monthly expenditures
(Ᵽ59,400 and Ᵽ38,300) are the same distance Ᵽ10,400 from the mean.
Example 2: A normal curve has a mean of 650 and a standard deviation of 40. An analyst is
interested in value of 575 and wants to find its equivalent 𝑧 − 𝑠𝑐𝑜𝑟𝑒.
Solution:
Given: 𝑥̅ = 650 𝑠 = 40 𝑥 = 575
Example 3: A time study reports indicates that can assembly line task should be finished in an
average of 5.64 minutes, with a standard deviation of 0.97 minutes. One particular item had a
𝑧 − 𝑠𝑐𝑜𝑟𝑒 of 1.53. What was the completion time of this item?
Solution:
Given: 𝑥̅ = 5.64 𝑠 = 0.97 𝑧 = 1.53
Substituting the given values t determine the 𝑥 − 𝑣𝑎𝑙𝑢𝑒, we get
𝑧 − 𝑠𝑐𝑜𝑟𝑒 = (𝑥 − 𝑥̅ ) 𝑥 = 𝑥̅ + 𝑧𝑠
𝑥 = 𝑥̅ + 𝑧𝑠 = 5.64 + (1.53)(0.97) = 5.64 + 1.4841 = 7.1241 ≈ 7.12 𝑚𝑖𝑛𝑠
John Wilder Tukey (1915 − 2000) introduced the boxplot in the 1970’s. A box-plot or
(𝒃𝒐𝒙 − 𝒂𝒏𝒅 − 𝒘𝒉𝒊𝒔𝒌𝒆𝒓 𝒑𝒍𝒐𝒕) is graph of the data set obtained by drawing a horizontal line from
the minimum data value to first quartile (𝑄1 ), drawing a horizontal line to third quartile (𝑄3 ) to
the maximum data value, and drawing a box whose vertical line passes through 𝑄1 and 𝑄3 with
a vertical line inside the box passing through the median or (𝑄2 ).
1. If the median is near the center of the box, the distribution is approximately symmetric.
2. If the median falls to the right of the center of the box, the distribution is negatively skewed.
3. If the median falls to the left of the center of the box, the distribution is approximately
positively skewed.
4. If the lines are about the same length, the distribution is approximately symmetric.
5. If the left line is larger than the right line, the distribution is negatively skewed.
6. If the right line is larger than the left line, the distribution is positively skewed.
𝑥 𝑙𝑜𝑤𝑒𝑠𝑡 𝑥 ℎ𝑖𝑔ℎ𝑒𝑠𝑡
𝑄1 𝑄2 = 𝑚𝑒𝑑𝑖𝑎𝑛 𝑄3
0 10 20 30 40 50 60
Example 1: Construct a box-plot for the data set of the ages of 9 middle- management
employees of a certain company. The ages are 53, 45, 59, 48, 54, 46, 51, 58, 55.
Step 2: Locate the lowest value, 𝑄1 , the median, 𝑄3 , and the highest value on the square.
Step 3: Draw a box around 𝑄1 and 𝑄3 , draw a vertical line through the median and connect the
upper and lower values.
𝑄1 =47 𝑄3 = 56.5
45 59
𝑚𝑒𝑑𝑖𝑎𝑛 = 53
40 45 50 55 60
The data set of the distribution is negatively skewed, since the median falls to the right of the
center of the box.
CARRY OUT
CHECKPOINT
2. Fifteen randomly selected business administration students were asked to state the number of
hours they slept last Sunday. The resulting data are 4,5,7,6,7,8,10,5,4,11,12,11,10,8, 𝑎𝑛𝑑 7. Find
the first quartile, second quartile, and third quartile.
CONTEMPLATE ON IT
CATCH IT
CONCEPTUALIZE
(c) Sample size increased & class width (d) Normal distribution for the population
decreased further
34.13% 34.13%
13.59% 13.59%
2.28% 2.28%
-3 -2 -1 0 1 2 3 𝑥
𝜇 − 3𝜎 𝜇 − 2𝜎 𝜇 − 1𝜎 𝜇 𝜇 + 1𝜎 𝜇 + 2𝜎 𝜇 + 3𝜎
𝐴𝑏𝑜𝑢𝑡 68%
𝐴𝑏𝑜𝑢𝑡 95%
𝐴𝑏𝑜𝑢𝑡 99.7%
𝑥−𝜇
Standard normal value: 𝑧 = 𝜎
Where: 𝑧 = z value
𝑥 = the value of any particular observation or measurement
𝜇 = the mean of the distribution
𝜎 = standard deviation of the distribution
The normal distribution property allows to compute a probability problem concerning 𝑥
into one concerning 𝑧. To determine the probability that 𝑥 lies in a given interval,
converting the interval to a 𝑧 scale and then compute the probability by using the
standard normal distribution table in ( ).
Example 1: Determine the area under the standard normal distribution curve
between 𝑧 = 0 𝑎𝑛𝑑 𝑧 = 1.85.
Solution: Draw the figure and represent the area as shown in the figure below.
Since Table A gives the area between 0 and any 𝑧 − 𝑣𝑎𝑙𝑢𝑒 to the right of 0, we
only need to look up the 𝑧 value in the table. Find 1.8 in the left column and
0.05 in the top row. The value where the column and row meet in the table is
the answer, 0.4678.
0 1.85
Example 2: Determine the area under the standard normal distribution curve
between 𝑧 = 0 𝑎𝑛𝑑 𝑧 = −1.15
−1.15 0
Example 3: Find the area under the standard normal distribution curve to the
right of 𝑧 = 1.15
The required area is at the right tail of the normal curve. Since the table gives an
area between 𝑧 = 0 𝑎𝑛𝑑 𝑧 = 1.15, first find that area.
Then subtract 𝑃(0 < 𝑧 < 1.15) = 0.3749 from 0.5000, since half of the area under
the curve is to the right of 𝑧 = 0. 0.1251
0 1.15
In conjunction with the standard normal value formula, many different types of
probability problems involving normal distribution can be resolved. To illustrate this , we will deal
with some examples .
𝑥−𝜇
𝑧= 𝜎
where: 𝑧 = z- value
𝑥 = value of any particular observation or measurement
𝜇 = population mean
𝜎 = population standard deviation
The formula is used to gain information about an individual data value when the variable is
normally distributed.
Example 1: The average Pag-ibig Salary Loan for RFS Pharmacy Inc. employees is Ᵽ23,000. If
the debt is normally distributed with a standard deviation of Ᵽ2,500, find the probability that the
employee owes less than Ᵽ18,500.
Solution:
Step 1: Draw a figure and represent the area.
𝑃(𝑥 < 18,500)
Step 3: Find the appropriate area. The area obtained in the Standardized Normal Distribution
refer to table is 0.4641, which corresponds to the area between 𝑧 = 0 and 𝑧 = −1.80.
𝑃(𝑥 < 18,500) = 𝑃(𝑧 < −1.80) = 0.5000 − 𝑃(−1.80 < 𝑧 < 0) = 0.5000 − 0.4641 = 0.0359
0.0359
18,500 23,000
Hence, the probability that the employee owes less than Ᵽ18,500 in Pag-ibig salary loan is
0.0359 or3.59%.
Example 2: The average age of bank managers is 40 years. Assume the variable is normally
distributed. If the standard deviation is 5 years, find the probability that the age of a randomly
selected bank manager will be in the range between 35 and 46 years old.
Solution: Assume that ages of bank managers are normally distributed; then cut off points as
shown in the figure below.
35 40 46
Step 4: Add 𝑃(−1.00 < 𝑧 < 0) and 𝑃(0 < 𝑧 < 1.20)
𝑃(35 < 𝑥 < 46) = 𝑃(−1.00 < 𝑧 < 1.20) = 𝑃(−1.00 < 𝑧 < 0) + 𝑃(0 < 𝑧 < 1.20)
35 40 46
Learning Module in GE 3(Mathematics in the Modern World): A Simplified Approach Page
34
x-value
-1.00 0 1.20 z-value
Hence, the probability that a randomly selected bank manager is between 35 and 46
years old is 0.7262 𝑜𝑟 72.62%.
CARRY OUT
1. A company produces different types of energy drinks. The filling machines are
adjusted to pour 500 ml of energy drinks into each plastic bottle. Nonetheless, the actual
amount of energy drink poured into each bottle is not exactly 500 ml. it varies from bottle to
bottle. It has been observed that the amount of energy drink in a bottle is normally distributed
with a mean of 500 ml and standard deviation of 4.75 ml. What percentage of the energy drink
bottles contains 505 to 513 ml?
CHECKPOINT
Direction: Read, analyze, and solve the following problems involving normal
distribution.
1. In a population of high school students’ algebra scores, the 𝜇 = 65, 𝜎 = 6. Find the 𝑧 −
𝑣𝑎𝑙𝑢𝑒𝑠 that correspond to a score 𝑥 = 80.
3. In a certain university, the students were informed that they need a grade in the top
8% of the Engineering students to get a scholarship for the next semester.
In a standardization of the test, the mean was 77 and the standard
deviation was 14. Assuming that the grade is normally distributed, what was
be the minimum grade to obtain the scholarship grant?
CONTEMPLATE ON IT
What have you learned from these lessons?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
_______________________________.
Learning Outcomes:
CATCH IT
CONCEPTUALIZE
Scatter diagram is useful tool for checking the assumptions in a regression analysis. It
can be viewed during an initial screening run of the analysis or after the analysis. The benefit of
looking at scatter diagram residuals in the beginning stages of an analysis is that it may save a
researcher’s time. If the assumptions are not met, further screening must be applied before the
analysis can be completed and data may require cleansing and transformation. In this case, the
researcher is not running analysis haphazardly. If the assumptions are met, the regression is
ready to be run and the researcher has increased confidence that the chances of making a Type
I or Type II error are reduced, ultimately improving the accuracy of any research results.
The correlation coefficient is defined as the covariance by the standard deviation of the
variables. The following formula is used to calculate the Pearson 𝑟 correlation.
Y Variablee
0 5 10 15 20
X Variable
X Variable
X Variable
X Variable
Negative Correlation
Positive Correlation
(𝑟 = −0.80)
(𝑟 = 0.80)
Y Variable
Y Variable
A test of significance for the coefficient of correlation may be used to find out if the
computed Pearson′ s 𝑟 could have occurred in a population in which the two variables are
related or not. The test statistics follows the 𝑡 distribution with 𝑛 − 2 degrees of
freedom. The significance is computed using the formula:
𝑟√𝑛−2
𝑡= Where: 𝑡 = t − test for correlation coefficient
√1−𝑟 2
𝑟 = correlation coefficient
𝑛 = number of paired samples
1. There is a direct cause- and –effect relationship between the two variables.
2. There is a reverse cause- and –effect relationship between the two variables.
3. The relationship between the two variables may be caused by the third variable.
4. There may be a complexity of interrelationship among many variables.
5. The relationship between the two variables may be coincidental.
Example 1: The owner of a chain of fruit shake stores would like to study the correlation
between atmospheric temperature and sales during the summer season. A random sample of 12
days is selected with the results given as follows:
Day 1 2 3 4 5 6 7 8 9 10 11 12
Temperature 79 76 78 84 90 83 93 94 97 85 88 82
(℉)
Total Sales 147 143 147 168 206 155 192 211 209 187 200 150
(units)
Plot the data on a scatter diagram. Does it appear there is a relationship between
atmospheric temperature and sales? Compute the coefficient of correlation. Determine at the
0.05 significance level whether the correlation in the population is greater than zero.
Solution:
250
200
150
Sales (Y)
100
50
0
75 85 95 105
Temperature (X)
𝑑𝑓 = 𝑛 − 2 = 12 − 2 = 10 𝑎𝑛𝑑 𝑡 = ±2.228
Step 5: Compute for the value of 𝑟 (Pearson Product- Moment Correlation Coefficient)
Day 𝑥 𝑦 𝑥2 𝑦2 𝑥𝑦
1 79 147 6,241 21,609 11,613
2 76 143 5,776 20,449 10,868
3 78 147 6,084 21,609 11,466
4 84 168 7,056 28,224 14,112
5 90 206 8,100 42,436 18,540
6 83 155 6,889 24,025 12,865
7 93 192 8,649 36,864 17,856
8 94 211 8,836 44,521 19,834
9 97 209 9,409 43,681 20,273
10 85 187 7,225 34,969 15,895
11 88 200 7,744 40,000 17,600
12 82 150 6,724 22,500 12,300
Total 1,029 2,115 88,733 380,887 183,222
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑟=
√[𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 ][𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2 ]
12(183,222) − (1,029)(2,115)
𝑟=
√[12(88,733) − (1,029)2 ][12(380,887) − (2,115)2 ]
22,329
= = 0.92700572554 ≈ 0.93
√[5,955][97,419]
Since the computed 𝑡 − value of 8.00 is greater than the tabular value of 2.228 at
level of significance of 0.05, we would need to reject the null hypothesis.
Step 7: Conclusion:
Since the null hypothesis has been rejected, we can conclude that there is evidence
that shows significant association between the atmospheric temperature and the total
sales of fruit shake.
A simple linear regression is the least estimator of a linear regression model with a
single predictor (or one independent variable). The least square model determines a
regression equation by minimizing the sum of squares of the vertical distances between the
actual 𝑦 values and the predicted values of 𝑦. Meaning, simple regression fits a straight line
through the set of 𝑛 points in such a way that makes the sum of squared residuals of the model
as small as possible. This method gives what is generally known as the "𝑏𝑒𝑠𝑡 − 𝑓𝑖𝑡𝑡𝑖𝑛𝑔" line. The
difference between an observed and predicted value is called the residual. The mean of the
residuals is always zero. The points that fall outside the overall pattern of the points are known
as outliers.
In a scatter plot, there are scores whose removal greatly changes the regression line
which are called influential scores. In some cases, these scores are restricted to points with
extreme 𝑥 − 𝑣𝑎𝑙𝑢𝑒𝑠. Some influential scores may have a small residual but still have a greater
effect on the regression line than scores with possibly larger residuals but average 𝑥 − 𝑣𝑎𝑙𝑢𝑒𝑠.
Solution:
Computation of the Simple Linear Regression Equation
Step 1: Obtain the sum of 𝑥, 𝑦, 𝑥 2 , 𝑦 2 𝑎𝑛𝑑 𝑥𝑦.
∑ 𝑥 = 1,029 ∑ 𝑥 2 = 88,733 ∑ 𝑥𝑦 = 183,222
∑ 𝑦 = 2,115 ∑ 𝑦 2 = 380,887
250
y = 3.7496x - 145.2728
200
150
Sales (Y)
100
50
0
75 80 85 90 95 100
Temperature (X)
Thus, the regression equation is 𝑦̂ = 3,7496𝑥 − 145.2582. The 𝑏1 of 3.7496 indicates that for
each additional temperature in Fahrenheit, sales are expected to increase by 3.7496 units. The
𝑏0 = value of −145.2782 indicates that the intercept with the 𝑦 − 𝑎𝑥𝑖𝑠 is below the origin. A
concrete interpretation is that if the temperature in Fahrenheit is zero, a negative 145.2782 units
would be sold.
CARRY OUT
Use the given problem below to practice solving the about correlation and regression.
1. A random sample of nine (9) cities gave the following figures for annual per capita of cigarette
consumption and annual death rate from lung cancer.
City 1 2 3 4 5 6 7 8 9
Cigarette 350 370 250 260 255 300 400 330 240
Consumption (x)
Death Rate (y) 21 24 17 18 17 19 25 20 16
a. Calculate the sample correlation 𝑟. At 0.01 level of significance, test whether cigarette
consumption and lung cancer are unrelated.
b. Determine the regression line.
CHECKPOINT
2. The city engineer wants to establish the relationship between household size and monthly
household water consumption. Given the data in the table, determine the following:
a. Plot the data in scatted plot and find the correlation 𝑟.
b. Determine whether we can conclude from these data that the two variables are linearly
related at 0.05 level of significance.
c. Find the regression line.
CONTEMPLATE ON IT
What have you learned from these lessons?
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
References:
Sirug, WS. (2018). Mathematics in the Modern World. A CHED General Education Curriculum
Compliant. Mindshapers Co.Inc. R. 108, Intramuros Corporate Plaza Bldg., Recoletos St.
Intramuros, Manila Philippines
Sirug, WS. (2015).Basic Probability and Statistics. A Step by Step Approach Revised Edition.
Mindshapers Co.Inc. R. 108, Intramuros Corporate Plaza Bldg., Recoletos St. Intramuros,
Manila Philippines
Amid, DM. (2005). Fundamentals of Statistics. Lorimar Publishing Co.Inc. 776 Aurora Blvd.,
cor. Boston Street, Cubao, Quezon City, Metro Manila.