Professional Documents
Culture Documents
Module 3 Describing Data
Module 3 Describing Data
Module 3 Describing Data
Introduction
Learning Outcomes
Summarize data using measures of central tendency such as mean, weighted mean, median, and
mode;
Describe data using measures of variation such as range, variance, standard deviation, and
coefficient of variation;
Identify the position of a data value in a data set using various measures of position such as
standard scores, percentiles and quartiles;
Use the techniques of exploratory data analysis, including boxplots and five-number summaries
to discover various aspects of data.
Advocate the use of statistical data in making important decisions
Learning Content
Suppose that you are browsing the net to look for information on internet providers in your place. To
make an informed decision regarding your planned subscription, you decide to collect as much
information as possible such as availability, monthly payment, internet speed, volume allowance, and
installation fee. What information is important in helping you make this decision?
Information from the net is sometimes presented to us in a frequency distribution. When we look at a
distribution of data, we should consider the three characteristics of the distribution: shape, center, and
spread.
Graphs are useful for the visual description of a data set, however, they are not always the best tool
when you want to make inferences or make decisions using information based on a sample. Hence, it
is better to use numerical measures such as center, relative position, and spread to describe the data.
There is a natural tendency for data to group around a central point. Any data set can be characterized
by measuring its central tendency. For teachers, one of the most useful statistics is the center point of
the data. Knowing the center point answers questions such as, “what is the average score?” or “who
scored below average?”
Definition: The measure of central tendency is a single score/value that indicates the center of a
distribution or data set.
Measures of Central Tendency are descriptive measures that are used to indicate where the center, the
middle property, or the most typical value of a set of data lies. There are three fundamental statistics
that measure the central tendency of data: the mean, median, and mode. All three provide insights into
“the center” of a distribution of data points.
A. The Mean
The mean is the most commonly used measure of center. It is the measure of central tendency that you
are most familiar with.
Definition: The mean, or arithmetic mean, of a set of data is equal to the sum of all the values
divided by the total number of data values.
The symbol ̅ , called “x bar”, is used to denote the mean of the sample and the symbol , called
“mu”, is used to represent the mean of the population.
Properties of Mean
A data set has one mean.
Easy to calculate and all scores/values in a data set are included in computing the mean.
The mean is the balance point in any shaped distribution.
Mean is only applicable to interval and ratio scales of data.
Mean is sensitive to the size (or weight) of each score and is affected by outliers.
Solution:
∑
̅= =
̅= = 13.25 Mbps
b. The given information is not enough since you have to consider other factors such as
availability, upload speed, volume allowance, installation fee, and early termination fee
to decide on your internet subscription.
Example 2. A sample of n = 7 scores has a mean of 12. If a new person with a score of 10 is added to
the sample, what is the value for the new sample mean?
Solution:
∑
̅= = ; n= 7, ̅ = 12, ∑ = 84
The new sample has n= 8, ∑ = 84 + 10 = 94. Thus, ̅ = = 11.75
Note: Some people refer to the mean as the “average”. In fact, there are many kinds of average;
the mean is just one of them.
Weighted Mean
As student, are you aware of how your instructors compute your final grade? How course units affect
the computation of your “General Weighted Average (GWA)” at the end of semester?
Weighted mean is typically used when computing for your academic performance. You can find your
GWA in the Certificate of Grades (CoG) issued by the Office of the Registrar every end of the
semester. If you did poorly in some of your courses with higher units, then your GWA will be greatly
affected.
Definition: A weighted mean is the average of all the entries with varying weights in a given data
set. The weighted mean is found by multiplying each value by its corresponding
weight and dividing the total by the sum of all the weights.
Example 3. In Ms. A’s Statistics class, your grade is determined from the following sources: 25%
each from prelim, midterm, and final examinations; 10% from your quizzes; 10% from
your problem sets; and 10% from your participation. Your scores are 85 (prelim), 89
(midterm), 82 (finals), 90 (quizzes), 95 (problem sets), and 98 (participation). What is
your grade in Statistics?
Solution:
Source Weight (w) Score (x) w. x
Prelim 0.25 85 21.25
Midterm 0.25 89 22.25
Final 0.25 82 20.5
Quizzes 0.10 90 9.0
Problem Sets 0.10 95 9.5
Participation 0.10 98 9.8
∑ = 1.0 ∑ = 92.3
∑
̅= ∑
= = 92.3
When dealing with data presented in a frequency distribution, access to the raw data is not possible.
Thus, to compute for the mean the computation is based on the use of class midpoint instead of all the
raw data in a set.
∑ f
Definition: The mean of a frequency distribution for a sample is approximated by 𝑥̅ =
𝑛
where x and f are midpoint and frequency of each class in the data set.
Example 4. The frequency distribution shows the salaries (in thousand pesos) for a specific year of
the faculty in a state university. Find the mean.
Table 2
Salary (in thousand pesos) 20 - 30 31 - 40 41 - 50 51 - 60 61 - 70 71 - 80 81 - 90 91 - 100
Frequency 10 15 12 6 3 2 1 1
Solution:
∑
̅= ∑
= = ₱ 43.9K
B. The Median
The word “median” is synonymous to the word “middle” and median is the middle value in a
distribution where half of the values in the set fall at or above the median and the other half fall at or
below the median. The median is generally referred to as positional average.
Definition: The median of the data set is the value that lies in the middle of the set when data is
arranged in ascending or descending order.
The median of the sample is sometimes denoted by ̃, called “x-tilde”, or M, MD, or Med; there is no
commonly accepted notation and there isn’t special symbol for the median of the population.
Properties of Median
The median is unique.
It can be used for ordinal, interval and ratio scales of data.
Median is less affected by outliers, and is most appropriate in a skewed data set.
Note: An outlier is a value that is either much higher or much lower than the median.
Solution:
a. Arrange the value in ascending order: 24k 29k 32k 38k 46k 66k 75k 85k 96k
Find n: n = 9
Determine the median: Since n = 9, the median is the value. = , thus, the
median is the 5th value; ̃ = ₱ 46K
b. The mean of the data set is 54.5k. However, looking at the raw data there are more
faculty whose salary is less than 54.5k. Therefore, median is the most appropriate
measure to use to have a better measure of central tendency.
Example 6. The data show the number of tablet sales in millions of units for a 6-year period. Find the
median of the data.
12.6 124.5 72.4 108.2 159.8 63.4
Solution:
Arrange the value in ascending order: 12.6 63.4 72.4 108.2 124.5 159.8
Find n: n = 8
Determine the median: Since n = 9, the median is the value or the average of the
two middle value. So, the median is = 90.3. The median of tablet sales is 90.3M.
C. The Mode
The mode is the value in the data set that appears with the highest frequency. It is considered an
inspection average. The set can be unimodal, bimodal, or multimodal. If the data set values have the
same frequency, then the set has no mode.
Definition: The mode of a data set is a value that occurs most frequently.
Sometimes mode is considered as the most popular option. In dealing with categorical data, mode is
usually used to find out the most common category.
Properties of Mode
The easiest average to find.
It can be used for nominal, ordinal, interval and ratio scales of data.
Mode is not affected by size/weight of scores and outliers
Example 7. The mobile data download speed (in Mbps) for a particular day is listed. Find the mode.
5.5 6.2 4.5 5.1 7.4 6.2 3.5 6.2
Solution:
The mode is 6.2 Mbps because it is the download speed occurring most often (3 times)
Example 8. Following is a list of the manufacturer of all cars available at Grab Transport App on a
particular day. Which manufacturer of of car is the mode?
Solution: The most frequent category is “Toyota,” which appears seven times. Therefore,
the mode is “Toyota”.
Note: Not all measures of central tendency can be used for all scales of measurement”. For
nominal data (such as sex or race), the mode is the only valid measure. For ordinal data
(such as salary categories), only median and mode can be used. For interval and ratio data
(without outliers), mean is used; and for interval and ratio data (with outliers), the median
is the most appropriate.
STATISTech
A. Using MS Excel
Open excel worksheet and enter the label of the data in A1. Starting at A2 enter the data up to An.
In a blank cell, key in =average(A2:An) to compute the mean of the data set. On another blank
cell, key in =median(A2:An) to compute the median of the data set. To compute for the mode,
type =mode(A2:An) on another blank cell.
B. Using SPSS
When instructors return your exam papers after checking and recording the scores, you often discuss
and compare your scores with friends and classmates. Did you notice the difference in your scores
from one test to another?
The word “vary” is synonymous to the word differ. In statistics, variability refers to how spread or
dispersed are the values in a distribution. Statisticians use measure of variation in addition to measure
of central tendency to describe the data set accurately.
Definition: The measure of variability is a number that represent a set of data based on how the
values differ of vary from each other.
The measure of variability, also known as measure of dispersion or spread, tells how varied are the
values in a data set or how much distance to expect between a given score/value and the mean. To
show the variability or spread of the values in a data set, three measures are commonly used: range,
variance and standard deviation.
A. The Range
One way to describe the difference between the values in a data set is to compare the highest and the
lowest value. The range is the difference between the highest and the lowest value in a data set and is
considered the simplest measure of spread.
Definition: The range (R) is the mathematical difference between the highest and the lowest value
in a set of data.
Properties of Range
The range is easy to compute.
It is very sensitive to extreme values.
Does not truly reflect the difference among all of the data values in a set.
Example 9. The data show the 2019 Metro Manila Film Festival (MMFF) earnings (in millions of
pesos) in ticket sales from five out of eight entries. Find the range.
320 72 90 412 18
Example 10. The following are the scores of 10 students in a 100-item Statistics exam: 82 77 90
84 68 88 62 94 71 67. Find the range.
Solution: HV – LV = 94 – 62 = 32
B. The Variance
In a statistics class, some students compute for variance first so that when they compute for standard
deviation they have to simply get the square root of variance. Based on this, there are students
assuming that the only use of variance is to make the computation of standard deviation faster.
What is variance?
Unlike the range, variance combines all values in a data set to produce a good measure of variability.
It is a measure that describes the spread of the data values from the mean and how each value relates
to each other. The difference between each value in a data set and the mean is called deviation.
In any data set, the sum of the positive and negative deviations from the mean is always equal to zero.
To resolve this problem on the “balancing act” of positive and negative deviations, square each
deviation from the mean, hence, the variance.
Definition: The variance of the set of data is the mean of the squared deviations of the values
from the mean
How to find the variance? For Sample (s²) For Population (σ²)
∑ ∑
1. Find the mean of the values in a data set. ̅= =
∑ ̅ ∑
5. Divide the sum in step 4 by total cases.
Properties of Variance
The variance is always non-negative.
It is sensitive to extreme values.
The units of variance are the squared units of the original data value.
Example 11. Using the data in example #9, find the variance.
320 72 90 412 18
Solution:
The calculations are shown in the table below
∑
Step 1. Find the mean: ̅= = = 182.4 M
Step 3. Square the deviations from the mean as shown in the 3 rd column
x x- ̅ ̅
320 (320 – 182.4) = 137.6 18,933.76
72 (72 – 182.4) = -110.4 12,188.16
90 (90 – 182.4) = -92.4 8,537.76
412 (412 – 182.4) = 229.6 52,716.16
18 (18 – 182.4) = -164.4 27,027.36
∑ ̅ = 119,403.2
Step 5. Divide the sum obtained in step 4 by n to obtain the sample variance
∑ ̅
s² = = = 29,850.8
Since variance is measured in squared units, the value 29,850.8 cannot be directly
related to the values in the data set. Hence calculating the standard deviation is
necessary.
Note: The units of measure in variance are squared values, making it difficult to interpret
variance. However, variance is important for conducting statistical inferences. When
comparing means, the hypothesis test to be used depends on variances are the same or not.
You use a different hypothesis test comparing means if the sample variances are the same or
not. To check for homogeneity of variances, you can use tests such as F-test, Bartlett’s test,
Levene’s test, and Tukey test.
In most situations, it is better to use a measure of dispersion that has the same units as the data.
Taking the square of variance, the standard deviation is obtained, which returns the measure of
dispersion to its original units.
Standard deviation describe how far, on average, each observation is from the typical data value. It is
based on the deviation about the mean.
Definition: The standard deviation is the positive square root of variance. It represents the
average deviation of a data value from the mean.
In statistical practice, standard deviation is the measure of spread commonly used in conjunction with
the mean. Accordingly, it measures spread around the mean. The more widely spread the values are,
the larger the standard deviation is.
How to find the standard deviation? For Sample (s) For Population (σ)
∑ ∑
1. Find the mean of the values in a data set. ̅= =
∑ ̅ ∑
5. Divide the sum in step 4 by total cases.
∑ ̅ ∑
6. Take the square root of the answer in step 5. √ √
Example 12. Using the answer in example # 11: a) find the standard deviation, and b) interpret the
result in the context of the data.
Solution:
a. The computed variance is 29,850.8. To get the standard deviation, get the square
root of the variance, that is s = √ = 172.77M
b. The standard deviation (s) = 172.77M is considered large which means that there
is a large difference on the individual earnings of the five entries to the MMFF.
Example 13. The different ISPs’ internet download speeds in Mbps for a particular day (presented in
example 1) were: 9.67, 20.92, 7.11, 18.93, and 9.64. Find the standard deviation of the
internet download speeds and interpret the result in the context of the problem.
Solution:
∑
Step 1. Find the mean: ̅= = = 13.25 Mbps
Step 3. Square the deviations from the mean as shown in the 3 rd column
x x- ̅ ̅
9.67 (9.67 – 13.25) = -3.58 12.8164
20.92 (20.92 – 13.25) = 7.66 58.8289
7.11 (7.11 – 13.25) = -6.14 37.6996
18.93 (18.93 – 13.25) = 5.68 32.2624
9.64 (9.64 – 13.25) = -3.61 13.0321
∑ ̅ = 154.6394
Step 5. Divide the sum obtained in step 4 by n to obtain the sample variance
∑ ̅
s² = = = 38.65985
The ISPs’ download speeds differ by 6.22 Mbps on the average. So, if you open the
Facebook App on different phones subscribed to the mentioned ISPs, the amount of time
it takes to load the feed would be different. Some will load faster or slower than the
others depending on the ISP’s download speed.
In general, a large standard deviation shows that the data values are far from the mean, and a small
standard deviation indicates that the data values are clustered around the mean. However, standard
deviation might be difficult to interpret in terms of how big it has to be in order to consider the date
widely spread.
To make sense with the value of standard deviation, there are two approaches for interpreting standard
deviation: the empirical rule and Chebyshev’s theorem.
Example 14. A class of 200 students took an exam. The scores had sample mean ̅ = 65 and a sample
standard deviation s = 10. The distribution is normal. Find: a) the interval that is likely
to contain approximately 68% of the scores, and b) Approximately what percentage of
the scores was between 45 and 85?
Solution:
a. Using the Empirical Rule, approximately 68% of the data set will be between ̅ – s and
̅ + s.
̅ – s = 65 – 10 = 55 ̅ + s = 65 + 10 = 75
It is likely that approximately 68% of the scores were between 55 and 75.
The empirical rule applies only to data sets with bell-shaped distributions. It gives an approximation
to the proportion of data that will be within one or two standard deviations of the mean. To interpret
standard deviation for any data set, use Chebyshev’s theorem.
The Chebyshev’s Theorem states that the proportion of any set of data lying within K standard
deviations of the mean is always at least 1 - ⁄ where K is any positive number greater than 1.
For K = 2 and K = 3, we get the following statements:
a) at least (or 75%) of all values lie within two standard deviations of the mean; and
b) at least (or 89%) of all values lie within three standard deviations of the mean.
Example 15. As part of a study on IQ as predictor of job performance, the data shows that IQ scores
have a mean of 100 and a standard deviation of 15. What information does
Chebyshev’s inequality provide about these data?
Conclude: At least 75% of the respondents had IQ scores between 70 and 130
At least 89% of the respondents had an IQ scores between 55 and 145
When comparing variation in samples with approximately the same mean, it is a good practice to
compare the two standard deviations. However, when comparing variation in samples with different
means or with different units, it is better to use coefficient of variation.
Coefficient of Variation
The coefficient of variation (CV) tells how large the standard deviation is relative to the mean. It can
be used to compare the spread of data sets whose values have different units.
Definition: The coefficient of variation (CV) of a data set describes the standard deviation as a
percent of the mean.
σ
Sample: CV = . 100 Population: CV = . 100
𝑥̅ 𝜇
Based on the definition, the numerator and denominator have the same units, so CV itself has no
units. Thus, you can directly compare variability of two different populations or samples.
Example 16. In ABC Auto Shop, the mean of the number of sales of cars over a 3-month period is 87,
and the standard deviation is 5. The mean of the sales commissions is ₱50225 and the
standard deviation is ₱7730. Compare the variations between car sales and sales
commissions.
Since the CV is larger for commission, the commissions are more variable than the
Sales.
Note: Standard deviation is usually preferred over variance because it is directly interpretable.
However, coefficient of variation is useful when comparing variation in population or
samples with different means or with different units.
STATISTech
A. Using MS Excel
Open excel worksheet and enter the label of the data in A1. Starting at A2 enter the data up to An.
In a blank cell, key in . . .
B. Using SPSS
You usually compare yourselves to others, whether it is test scores, height, weight, or your allowance
in a day. When you scored 30 out of 50 in an exam, you also want to know how your score compared
to the scores of your classmates. Sometimes you need to know the position of one observation relative
to others in a data set.
There are three ways on how to locate the relative position of a data value in a data set: z-scores,
percentiles, and quartiles.
Suppose you also want to know how your score of 30 compared to the scores of your classmates on a
50-item exam. The mean and the standard deviation of the scores can be used to compute the z-score,
which measures the distance between a particular score and the mean, measured in units of standard
deviation.
The z-score tells how many standard deviations a data value is above or below the mean in a data set.
The z-score can be positive, negative, or zero. When z is positive, the corresponding x-value is greater
than the mean. When z is negative, the corresponding x-value is less than the mean. For z = 0, the
corresponding data value is the same as the mean.
Example 17. In a Statistics class, Jen scored 30 pts in a 50-pt examination. She wants to know her
score of 30 compares to the scores of her classmates. The mean and the standard
deviation of the exam scores are 25 and 4, respectively. Calculate Jen’s z-score and
interpret the result.
̅ ̅̅̅̅
Solution: z= = = 1.25
Jen’s score is 1.25 standard deviations higher than the mean (30 = 25 + 1.25s).
Example 18. The mean speed of cars along a stretch of highway is 56 kph with a standard deviation
of 4 kph. The MMDA personnel measure the speed of three cars travelling along this
strech of highway on a particular day as 62kph, 47 kph, and 56 kph. Find the z-score
that corresponds to each speed and interpret the result.
̅ ̅
Solution: = = = 1.5 = = = -2.25
̅
= = =0
The speed of 62 kph is 1.5 standard deviations from the mean; a speed of 47kph is
2.25 standard deviations below the mean; and the speed of 56 kph is equal to the
mean. The car travelling at 47kph is said to be travelling slowly because the speed
corresponds to z = -2.25.
A z-score is a measure of position, because it describes the location of a data value relative to the
mean. Using the Empirical Rule, you can easily interpret z-scores for data set that is normal (bell-
shaped). For skewed distributions, it is difficult to interpret the z-scores, so it is better to use other
measures of position.
B. Quartiles
When analyzing data sets, sometimes it is helpful when you grouped the set into sub-groups.
C. Percentiles
C. Quartiles
Try These:
1. The following are Jay’s grades las semester. Compute his GWA if: a. PE grade is included in
the computation. b. PE grade is not included in the computation.
2.
The following videos on YouTube and other online sources will supplement your learning on
describing data.
b. Measures of Variability
https://numberbender.com/lessons/view/1198/2.1-Measure-of-Center-and-Spread
c. Measures Of Position
https://www.youtube.com/watch?v=CiZCtar7iI8
Online (Synchronous)
Zoom, Edmodo, Facebook Messenger
Remote (Asynchronous)
Module 3, Problems Sets, PowerPoint Lessons, Consultations through GC
Assessment Task
Problem Set # 3
1. Consider the following data obtained from a random sample of 50 credit card accounts. Identify
all appropriate measure of central tendency that can be used to summarize the data.
a. outstanding balance on each account
b. type of credit card (e.g., MasterCard, Visa, American Express, etc.)
c. amount due on next payment
References
Bluman, Allan G. (2017). Elementary Statistics: A Step By Step Approach. McGraw-Hill Education
Larson, Ron & Farber, Betsy. (2014). Elementary Statistics: Picturing the World. 6th Edition. Pearson
Education, Inc.
Navidi, William C. & Monk, Barry J. (2019). Elementary Statistics. 3 rd Edition. McGraw-Hill
Education
Triola, Mario F. (2017). Elementary Statistics. 13th Edition. Pearson Education, Inc.