Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 123

CHAPTER 2

DESCRIPTIVE DATA
SQQS1013 ELEMENTARY
STATISTICS
ORGANIZING AND
VISUALIZING DATA
Objectives
In this chapter you learn:
• Organizing categorical variables.
• Organizing numerical variables.
• Visualizing categorical variables.
• Visualizing numerical variables.
• Organizing and visualizing a mix of variables.
• The challenge in organizing and visualizing variables.
2.1 INTRODUCTION
Example
Here is a list of questions asked in a large statistics class and the data value
given by one of the students:
i. What is your sex (m=male, f=female)? m
ii. How many hours did you sleep last night? 5 hours
iii. Randomly pick a letter, S or Q. S
iv. What is your height in inches? 67 inches
v. What’s the fastest you’ve ever driven a car (mph)? 110 mph
Raw data - Data recorded in the sequence in which
they were originally collected, before being
processed or ranked.

Array data - Raw data that are arranged in


ascending or descending order.
PRESENTATION OF
DATA
• Organizing Data Creates Both Tabular and Visual Summaries.

• Summaries both guide further exploration and sometimes facilitate decision-


making.

• Visual summaries enable rapid review of larger amounts of data & show
possible significant patterns.

• Often, the Organize and Visualize step in DCOVA occur concurrently.


2.2 PRESENTATION
OF
QUALITATIVE
DATA
2.2.1 Organizing Categorical Data
Categorical
Data

Tallying Data

One Two
Categorical Categorical
Variable Variables

Summary Contingency
Table Table
• Summary Table tallies the frequencies or percentages of items in a set of
categories so that you can see differences between categories.

Table 2.3 Main Reason Young Adults Shop Online


Reason of Shopping Online? Frequency Percentage
Better Prices 555 37%
Avoiding holiday crowds or hassles 435 29%
Convenience 270 18%
Better selection 195 13%
Ships directly 45 3%
Total 1500 100%
Source: Data extracted and adapted from “Main Reason Young Adults Shop Online?”
USA Today, December 5, 2012, p. 1A.
Percentage =
• Contingency Table

– Helps Organize Two or More Categorical Variables

– Used to study patterns that may exist between the responses of two or more categorical
variables.

– Cross tabulates or tallies jointly the responses of the categorical variables.

– For two variables the tallies for one variable are located in the rows and the tallies for the
second variable are located in the columns
Example 2.1: Contingency Table
• A random sample of 400 invoices is Table 2.4 Contingency Table Showing
drawn. the frequency of Invoices Categorized
by Size and the Presence of Errors
• Each invoice is categorized as a No
small, medium, or large amount. Errors Errors Total
• Each invoice is also examined to Small 170 20 190
Amount
identify if there are any errors.
Medium 100 40 140
• This data are then organized in the Amount
contingency table, as in the right Large 65 5 70
place. Amount
Total 335 65 400
Contingency Table based on Percentage of Overall Total
No DCOVA
Errors Errors Total
42.50% = 170 / 400
Small 170 20 190 25.00% = 100 / 400
Amount
16.25% = 65 / 400
Medium 100 40 140
Amount
No
Large 65 5 70 Errors Errors Total
Amount
Small 42.50% 5.00% 47.50%
Total 335 65 400 Amount
Medium 25.00% 10.00% 35.00%
Amount
83.75% of sampled invoices
Large 16.25% 1.25% 17.50%
have no errors and 42.50% Amount
of sampled invoices are for Total 83.75% 16.25% 100.0%
small amounts.
Contingency Table based on Percentage of Row Totals
No DCOVA
Errors Errors Total 89.47% = 170 / 190
Small 170 20 190 71.43% = 100 / 140
Amount
92.86% = 65 / 70
Medium 100 40 140
Amount
No
Large 65 5 70 Errors Errors Total
Amount
Small 89.47% 10.53% 100.0%
Total 335 65 400 Amount
Medium 71.43% 28.57% 100.0%
Amount
Medium invoices have a larger Large 92.86% 100.0%
chance (28.57%) of having Amount 7.14%
errors than small (10.53%) & Total 83.75% 16.25% 100.0%

large (7.14%) invoices.


Contingency Table based on Percentage of Column
Totals
No
Errors Errors
DCOVA
Total
Small 170 20 190 50.75% = 170 / 335
Amount
30.77% = 20 / 65
Medium 100 40 140
Amount
No
Large 65 5 70 Errors Errors Total
Amount
Small 50.75% 30.77% 47.50%
Total 335 65 400 Amount
Medium 29.85% 61.54% 35.00%
Amount
There is a 61.54% chance Large 19.40% 7.69% 17.50%
that invoices with errors are Amount
of medium size. Total 100.0% 100.0% 100.0%
2.2.2 Visualizing Categorical Data
DCOVA
Categorical
Data

Visualizing Data

Summary Contingency
Table for Table for
One Variable Two Variables

Bar Pareto Component /


Chart Chart Doughnut
Multiple Bar
Chart
Chart
Pie or
Doughnut Chart
The Bar Chart
The bar chart visualizes a categorical variable as a series of bars. The length of
each bar represents either the frequency or percentage of values for each
category. Each bar is separated by a space called a gap.

Reason for Percent


Shopping Online?
Better Prices 37%
Avoiding holidays 29%
crowds or hassles
Convenience 18%
Better selection 13%
Ships directly 3%
The Pie Chart
The pie chart is a circle broken up into slices that represent categories. The size
of each slice of the pie varies according to the percentage in each category.

Reason for Shopping Percent


Online?
Better Prices 37%
Avoiding holiday crowds or 29%
hassles
Convenience 18%
Better selection 13%
Ships directly 3%
The Doughnut Chart DCOVA
▪ The doughnut chart is the outer part of a circle broken up into pieces
that represent categories. The size of each piece of the doughnut varies
according to the percentage in each category.
Doughnut Chart of Reasons to Shop Online

Reason For Shopping Percent


Online?
Better Prices 37%
Avoiding holiday crowds or 29%
hassles
Convenience 18%
Better selection 13%
Ships directly 3%
The Pareto Chart
DCOVA
• Used to portray categorical data (nominal scale).

• A vertical bar chart, where categories are shown in descending order of


frequency.

• A cumulative polygon is shown in the same graph.

• Used to separate the “vital few” from the “trivial many.”


The Pareto Chart (con’t)
DCOVA
Table 2.5 Ordered Summary Table For Causes
Of Incomplete ATM Transactions
Cumulative
Cause Frequency Percent Percent
Warped card jammed 365 50.41% 50.41%
Card unreadable 234 32.32% 82.73%
ATM malfunctions 32 4.42% 87.15%
ATM out of cash 28 3.87% 91.02%
Invalid amount requested 23 3.18% 94.20%
Wrong keystroke 23 3.18% 97.38%
Lack of funds in account 19 2.62% 100.00%
Total 724 100.00%

Source: Data extracted from A. Bhalla, “Don’t Misuse the Pareto Principle,” Six Sigma
Forum
Magazine, May 2009, pp. 15–18.
The Pareto Chart (con’t) DCOVA

The “Vital
Few”
Multiple (Side By Side) Bar Charts
▪ The side by side bar chart represents the data from a contingency table. DCOVA
No
Errors Errors Total
Small 50.75% 30.77% 47.50%
Amount
Medium 29.85% 61.54% 35.00%
Amount
Large 19.40% 7.69% 17.50%
Amount
Total 100.0% 100.0% 100.0%

Invoices with errors are much more likely to be of


medium size (61.5% vs 30.8% & 7.7%).
Component Bar Charts
▪ The component bar chart represents the data from a contingency table. DCOVA
No
Errors Errors Total
Small 50.75% 30.77% 47.50%
Amount
Medium 29.85% 61.54% 35.00%
Amount
Large 19.40% 7.69% 17.50%
Amount
Total 100.0% 100.0% 100.0%

Invoices with errors are much more likely to be of


medium size (61.5% vs 30.8% & 7.7%).
Doughnut Charts DCOVA
▪ A Doughnut Chart can be used to represent the data from a contingency table.

No
Errors Errors Total
Small 50.75% 30.77% 47.50%
Amount
Medium 29.85% 61.54% 35.00%
Amount
Large 19.40% 7.69% 17.50%
Amount
Total 100.0% 100.0% 100.0%

Invoices with errors are much more likely to be of


medium size (61.5% vs 30.8% & 7.7%).
EXERCISE 2.1
A recent consumer survey on holiday i. Construct a bar chart for the types of stores
customers plan to shop at.
shopping reveals the following
ii. construct a pie chart for the types of stores
information on the types of stores at customers plan to shop at.
which consumers plan to shop.
iii. What is the type of stores that the most
customers plan to shop at?
Types of Stores % of iv. What is the percentage of the top 2
Customers categories of stores that customers plan to
Stand-alone “big box” stores 54 shop at make up out of the 6 categories of
Traditional mall 61 shopping preferences.
Local independent stores not in a 35
mall v. What is the % of the customers surveyed
Strip mall or mini mall 25 mentioned that they did not plan to shop at
Town hall mall 14 any of these stores.
I do not plan to shop at any of 9
these
SOLUTION
EXERCISE 2.1
i. iii. Traditional Mall
iv. [(61 + 54) / 198 ]* 100% = 58%
v. 9%

ii.
2.3 PRESENTATION
OF
QUANTITATIVE
DATA
2.3.1 Organizing Quantitative Data

Numerical Data

Ordered Array Frequency Cumulative


Distributions Distributions
Ordered Array
▪ An ordered array is a sequence of data, in rank order, from the smallest value to the
largest value.

▪ Shows range (minimum value to maximum value).

▪ May help identify outliers (unusual observations).

Age of Surveyed Day Students


College Students
16 17 17 18 18 18
19 19 20 20 21 22
22 25 27 32 38 42
Night Students
18 18 19 19 20 21
23 28 32 33 41 45
Frequency Distribution
▪ The frequency distribution is a summary table in which the data are
arranged into numerically ordered classes → group data
▪ You must give attention to
i. selecting the appropriate number of class (Sturge’s Rule) for the table,
c = 1 + 3.3 log n

ii. determining a suitable width of a class, and establishing the boundaries of each class to
avoid overlapping.

c shall be
i must always be
rounded-up or
rounded-up
rounded down
iii. Starting point of the 1st class
=> use the smallest value in the data set.
Example 2.2
Frequency Distribution Example
A manufacturer of insulation randomly selects 20 winter
days and records the daily high temperature.

24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
Solution Example 2.2
•  
Solution Example 2.2 (con’t)
Data in Ordered Array
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class Frequency

12 - 21 4
22 - 31 6
32 - 41 5
42 - 51 3
52 - 61 2
Total 20
Solution Example 2.2 (con’t)

Relative
Class Frequency Frequency Percentage

12 - 21 4 0.20 20%
22 - 31 6
0.30 30%
32 - 41 5
0.25 25%
42 - 51 3 0.15 15%
52 - 61 2 0.10 10%

Total 20 1.00 100%

Relative Frequency = Frequency / Total ; 2 / 20 = 0.1


Solution Example 2.2 (con’t)

Cumulative Cumulative
Class Frequency Percentage Frequency Percentage

12 - 21 4 20% 4 20%
22 - 31 630% 10 50%
32 - 41 525% 15 75%
15% 18 90%
42 - 51 3
10% 20 100%
52 - 61 2
100%
100%
Total 20
Cumulative Percentage = (Cumulative Frequency / Total) * 100 ; (10/20) * 100 = 50%
Why Use a Frequency Distribution?
• It condenses the raw data into a more useful form.
• It allows for a quick visual interpretation of the data.
• It enables the determination of the major characteristics of the data set including where the
data are concentrated/clustered.

Frequency Distribution: Tips


• As the size of the data set increases, the impact of alterations in the selection of class
boundaries is greatly reduced.
• When comparing two or more groups with different sample sizes, you must use either a
relative frequency or a percentage distribution.
2.3.2 Visualizing Numerical Data
Numerical Data

Frequency Distributions
Ordered Array and
Cumulative Distributions

Stem-and-Leaf
Display Histogram Polygon Ogive
Stem-and-Leaf Display
A simple way to see how the data are distributed and where concentrations of
data exist.

METHOD: Separate the sorted data series


into leading digits (the stems) and
the trailing digits (the leaves).
Stem and Leaf Display
A stem-and-leaf display organizes data into groups (called stems) so that the
values within each group (the leaves) branch out to the right on each row.

Age of College Students


Age of Day Students Day Students Night Students
Surveyed
College 16 17 17 18 18 18
Stem Leaf Stem Leaf
Students 19 19 20 20 21 22
1 6 7 7 8 8 8 9 9 1 8 8 9 9
22 25 27 32 38 42
2 0 0 1 2 2 5 7 2 0 1 3 8
Night Students
18 18 19 19 20 21 3 2 8
3 2 3
23 28 32 33 41 45 4 2
4 1 5
The Histogram
▪ A vertical bar chart of the data in a frequency distribution is called a histogram.

▪ In a histogram there are no gaps between adjacent bars.

▪ The class boundaries (or class midpoints) are shown on the horizontal axis.

▪ The vertical axis is either frequency, relative frequency, or percentage.

▪ The height of the bars represent the frequency, relative frequency, or percentage.
The Histogram
Relative Percentage
Class Frequency Frequency
12 - 21 3 0.15 15
22 - 31 6 0.30 30
32 - 41 5 0.25 25

42 - 51 4 0.20 20
52 - 61 2 0.10 10

Total 20 1.00 100

(In a percentage histogram the


vertical axis would be defined to
show the percentage of
observations per class).
The Polygon
▪ A percentage polygon is formed by having the midpoint of each class
represent the data in that class and then connecting the sequence of
midpoints at their respective class percentages.

▪ The cumulative percentage polygon, or ogive, displays the variable of


interest along the X-axis, and the cumulative percentages along the Y-axis.

▪ Useful when there are two or more groups to compare.


The Frequency Polygon
Useful When Comparing Two or More Groups
The Percentage Polygon
Ogive
• An ogive is a curve drawn for the cumulative frequency distribution.
• Two types of ogive:

(1) ogive less than


(2) ogive greater than
• Steps:
– Build a table of cumulative frequency.
– Draw x and y axes. Label x = class boundaries, y = cumulative frequencies.
– Plot graph using the appropriate class boundary.
– Join the 1st appropriate class boundary to the consecutive points.
Ogive

46
Ogive

47
2.3.3 Visualizing Two Numerical
Variables

Two Numerical
Variables

Scatter Time-Series
Plot Plot
The Scatter Plot
▪ Scatter plots are used for numerical data consisting of paired observations
taken from two numerical variables.

▪ One variable is measured on the vertical axis and the other variable is
measured on the horizontal axis.

▪ Scatter plots are used to examine possible relationships between two


numerical variables.
Scatter Plot Example
Volume Cost per
per day day
23 125
26 140
29 146
33 160
38 167
42 170
50 188
55 195
60 200
The Time Series
Plot
• A Time-Series Plot is used to study patterns in the
values of a numeric variable over time.

• The Time-Series Plot


– Numeric variable is measured on the vertical axis and the
time period is measured on the horizontal axis.
Time Series Plot Example
Number of
Year Franchises
2007 43
2008 54
2009 60
2010 73
2011 82
2012 95
2013 107
2014 99
2015 95
EXERCISE 2.2
The histogram below represents i. How many percent of the job applicants
scored between 10 and 20? 20%
scores achieved by 200 job applicants
ii. How many percent of the job applicants
on a personality profile. scored below 50? 80%
iii. What is the number of job applicants who
scored between 30 and below 60. 80
iv. What is the number of job applicants who
scored 50 or above. 40
v. 90% of the job applicants scored above or
equal to 10__.
vi. Half of the job applicants scored below
30__.
NUMERICAL
DESCRIPTIVE
MEASURE
Objectives
In this topic, you learn to:

• Describe the properties of central tendency, variation, and shape


in numerical variables.

• Construct and interpret a boxplot.

• Compute descriptive summary measures for a population.


Summary
▪ The central tendency is the extent to which the values of a numerical variable
group around a typical or central value.

▪ The variation is the amount of dispersion or scattering away from a central


value that the values of a numerical variable show.

▪ The shape is the pattern of the distribution of values from the lowest value to
the highest value.
2.4 MEASURE OF
CENTRAL
TENDENCY
2.4.1 MEAN
2.4.1.1 UNGROUP DATA
•  
For a sample of size n:

Pronounced x-bar
The ith value

Sample size Observed values


EXAMPLE 2.3

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Mean = 13 Mean = 14
2.4.1 MEAN
2.4.1.2 GROUP DATA

The ith value


Pronounced x-bar

Mid-point of a
Total of class
Frequency of a
frequency/
class
Sample size
EXAMPLE 2.3
a. During a semester, a student took five exams. The population of exam
scores is 78, 83, 92, 68, and 85. Find the mean. (406, 81.2)

b. The following table shows the speeds (in km/h) of 30 cars


measured at the certain checkpoint. (1504, 50.13)

41 53 58 67 33 61 43 45 42 67
39 48 36 47 34 59 57 54 65 69
63 42 60 48 66 30 30 46 52 49
c) The following table presents the daily high temperature in a manufacturer of
insulation for randomly selected 20 winter days (Refer Example 2.2).
Approximate the mean of daily high temperature. (34.5)

Class Frequency Midpoint f (x)


(f) (x)
12 - 21 3
16.5 49.5
22 - 31 26.5 6 159
32 - 41 36.5 5 182.5
46.5 186
42 - 51 4
56.5 113
52 - 61 2
690
Total 20
2.4.2 MEDIAN
2.4.2.1 UNGROUP DATA
• In an ordered array, the median is the “middle” number (50% above,
50% below).

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Median = 13 Median = 13

• Less sensitive to extreme values compared to mean.


Procedure of Computing Median
1. Arrange data in ascending order.

2. The location of the median when the values are in numerical


order (smallest to largest):

– If the number of values is odd, the median is the middle number.


– If the number of values is even, the median is the average of the two middle numbers.

3. Find the median.


2.4.2 MEDIAN
2.4.2.2 GROUP DATA
•  

Class width

Cumulative freq
before a class
Total freq
median

Lower boundary
of class median Freq of a class median
EXAMPLE
2.4
a. During a semester, a student took five exams. The population of exam scores
is 78, 83, 92, 68, and 85. Find the median. (83)

b. One of the goals of medical research is to develop treatments that reduce the
time spent in recovery. Eight patients undergo a new surgical procedure, and
the number of days spent in recovery for each is as follows. Find the
median. (17)
c. The following table presents the daily high temperature in a manufacturer of
insulation for randomly selected 20 winter days(Refer to Example 2.2).
Approximate the median of daily high temperature. (33.5)

Class Frequency Cumulative


Frequency
12 - 21 33
22 - 31 9 6
32 - 41 14 5 Class Median → 20/2 = 10
18
42 - 51 4
20
52 - 61 2

Total 20  
2.4.3 MODE
2.4.3.1 UNGROUP DATA
DCOVA
• Value that occurs most often.
• Not affected by extreme values.
• Used for either numerical or categorical data.
• There may be no mode.
• There may be several modes.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Mode = 9 No Mode
2.4.3 MODE
2.4.3.2 GROUP DATA
• Determine class mode (or, modal class) - the class with the highest frequency.
• Use the following formula
Class width

the difference between the


frequency of class mode and the
frequency of the class before the
class mode

Lower
boundary of the difference between the
class median frequency of class mode and the
frequency of the class before
the class mode
Approximating mode using histogram

-0.5 49.5 99.5 149.5 199.5 249.5 299.5 No. of text messages

MODE = 140 71
EXAMPLE 2.5
a. Ten students were asked how many siblings they had. The results, arranged
in order, were
0111122336
Find the mode of this data set.(1).
b. The following table presents the daily high temperature in a manufacturer of
insulation for randomly selected 20 winter days(Refer Example 2.2).
Approximate the mode of daily high temperature. (29.0)

Class Frequency

12 - 21 3
22 - 31 6 Class Mode → highest freq.
32 - 41 5

42 - 51 4
52 - 61 2

Total 20
Which Measure to Choose?
DCOVA
▪ The mean is generally used, unless extreme values (outliers) exist.
▪ The median is often used, since the median is not sensitive to extreme values.
For example, median home prices may be reported for a region; it is less
sensitive to outliers.
▪ In many situations it makes sense to report both the mean and the median.
Describing the Shape of a Data Set
• The mean and median measure the center of a data set in different ways.
When a data set is symmetric, the mean, median and mode are equal.
• When a data set is skewed to the right, there are large values in the right tail.
Because the median is resistant while the mean is not, the mean is generally
more affected by these large values. Therefore for a data set that is skewed to
the right, the mean is often greater than the median greater than the mode.
• Similarly, when a data set is skewed to the left, the mean is often less than the
median less than the mode.

75
i. Approximately Symmetric
Shape: Approximately Symmetric

Relationship Between
the Mean, Median and Mode: Mean, median and mode are approximately the same

76
ii. Skewed to the Right
Shape: Skewed to the Right

Relationship Between
the Mean, Median and Mode : Mean is noticeably greater than the median greater than the
mode.

77
iii. Skewed to the Left
Shape: Skewed to the Left

Relationship Between
the Mean, Median and Mode: Mean is noticeably less than the median less than the mode.

78
Summary of Measure of Central
Tendency
Data
Measure
Ungrouped Grouped

Mean

Mode = value with


Mode highest frequency (could
be > 1)

Median
79
2.5 MEASURE OF
POSITION

80
DCOVA
Position

Percentiles Quartiles

Measures of position are techniques that divide a set of data into equal groups.
To determine the measurement of position, the data must be sorted from lowest to highest. The different
measures of position are percentiles and quartiles
2.5.1 PERCENTILES
• The mean and median of a data set describe the center of a distribution
(quantitative).
• For some data it is often useful to compute measures of positions other than
the center, to get a more detailed description of the distribution.
• Percentiles provide a way to do this. Percentiles divide a data set into
hundredths.
• Definition: For a number p between 1 and 99, the pth percentile separates the
lowest p% of the data from the highest (100 – p)%.

82
2.5.1 PERCENTILES
UNGROUPED DATA
• First, the data need to be arranged in increasing order.

• To compute the data value corresponding to a given percentile:

 
– If L is a whole number, then the pth percentile is the average of the number in position L and the number in position (L+1).
– If L is not a whole number, round it up to the next higher whole number. The pth percentile is the number in the position corresponding to the
rounded-up value.

• To compute the percentile corresponding to a given data value, X:

 
– Round the result to the nearest whole number.

83
EXAMPLE 2.6
A teacher gives a 20-points test to 10 students. The scores are shown here.
18 15 12 6 8 2 3 5 20 10
1. Find the value corresponding to the 25th and 60th percentile (5, 11).

2 3 5 6 8 10 12 15 18 20

   
2. Find the percentile rank of a score of 6 and 12 (35, 65).

   
84
2.5.2 QUARTILES
• There are 3 percentiles that are used more often than the others - the 25th, the
50th, and the 75th .
• These percentiles divide the data into 4 parts, each of which contains
approximately one quarter of the data.
• Thus, these 3 percentiles are called quartiles.
• Can visualize the distribution of the values for a numerical variable by
computing:
– The quartiles.
– The five-number summary.
– Constructing a boxplot.

85
DCOVA
2.5.2 QUARTILE MEASURES
2.5.2.1 UNGROUPED DATA
• Quartiles split the ranked data into 4 segments with an equal number of values
per segment.

25% 25% 25% 25%

Q1 Q2 Q3

■ The first quartile, Q1, is the value for which 25% of the
values are smaller and 75% are larger.
■ Q2 is the same as the median (50% of the values are
smaller and 50% are larger).
■ Only 25% of the values are greater than the third quartile -
separates the lowest 75% of the data from the highest 25%.
•  
2.5.1 QUARTILE MEASURES
2.5.1.2 GROUPED DATA
•  
EXAMPLE 2.7
• Following are final exam scores, arranged in increasing order for 28 students.

58 59 62 64 67 68 69 71 73 74 74 75 76 76
76 77 78 78 78 82 82 84 86 87 87 88 91 97

a. Find the 1st quartile of the scores (70).


b. Find the 3rd quartile of the scores (83).

89
EXAMPLE 2.8
The following table presents the daily high temperature in a manufacturer of
insulation for randomly selected 20 winter days(Refer Example 2.2). Calculate
the Q1 and Q3.

Class Frequency Cumulative


Frequency
12 - 21 3 3
22 - 31 9 6 Class Q1 → 20/4 = 5
32 - 41 14 5
18 Class Q3 → 3(20)/4 = 15
42 - 51 4
20
52 - 61 2
  Total 20
 
Conclusions: Measures of Positions
Data
Measureme
nt
Ungrouped Grouped

Percentiles −

1st Quartile

3rd Quartile 91
2.6 MEASURE OF
DISPERSION
DCOVA
Variation

Range Variance Standard Coefficient


Deviation of Variation

■ Measures of variation give


information on the spread
or variability or
dispersion of the data
values.
Same center,
different variation
2.6.1 THE RANGE DCOVA
2.6.1.1 UNGROUP DATA
▪ Simplest measure of variation.
▪ Difference between the largest and the smallest values:

Range = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 13 - 1 = 12
2.6.1 THE RANGE
2.6.1.2 GROUP DATA
 

Class Frequency
41 – 50 1 Upper bound of last class =
51 – 60 3 100.5
61 – 70 7 Lower bound of first class =
71 – 80 13 40.5
81 – 90 10
Range = 100.5 – 40.5 = 60
91 - 100 6
Total 40
Why The Range Can Be Misleading
DCOVA
▪ Does not account for how the data are distributed.

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

▪ Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
EXAMPLE 2.9
The following table presents the average monthly temperature, in degrees Fahrenheit, for the
cities of San Francisco and St. Louis. Compute the range for each city.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
San 51 54 55 56 58 60 60 61 63 62 58 52
Francisco
St. Louis 30 35 44 57 66 75 79 78 70 59 45 35
Source: National Weather Service

Solution:
The range for San Francisco is 63 – 51 = 12.

The range for St. Louis is 79 – 30 = 49.


97
2.6.2 INTERQUARTILE RANGE (IQR)
• Quartiles can be used as a rough measurement of variability.
• The interquartile range is the range of the middle 50% of the data.
• The IQR is a measure of variability that is not influenced by outliers or
extreme values.
• Measures like Q1, Q3, and IQR that are not influenced by outliers are called
resistant measures.
• It is defined as the difference between the first quartile and the third quartile.

IQR = Q3 – Q1
98
EXAMPLE 2.10
Table below list the total revenue for the 12 top tourism company in Malaysia
109.7 79.9 74.1 121.2 76.4 80.2 82.1 79.4 89.3 98.0 103.5 86.8
Determine the interquartile of the data (79.5, 102.1, 22.6)

74.1 76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7 121.2

Answer : 79.65, 100.75, 21.1

99
2.6.3 VARIANCE
• Although the range is easy to compute, it is not often used in practice. The reason is
that the range involves only two values from the data set; the largest and smallest.
• The measures of spread that are most often used are the variance and the standard
deviation, which use every value in the data set.
• When a data set has a small amount of spread, like the San Francisco temperatures,
most of the values will be close to the mean. When a data set has a larger amount of
spread, more of the data values will be far from the mean.
• The variance is a measure of how far the values in a data set are from the mean, on
the average.
• The variance is computed slightly differently for populations and samples.

100
Population Sample
• In the formula, the mean μ is replaced by
the sample mean and the denominator is n
– 1 instead of N. The sample variance is
denoted by s2.

   

101
Sample Variance
Ungrouped Grouped
•   •  

102
EXAMPLE 2.11
A company that manufactures batteries is testing a new type of battery designed for laptop
computers. They measure the lifetimes, in hours, of six batteries, and the results are presented in
the following table. Find the variance of the lifetimes. (2)

Battery Lifetime 3 4 6 5 4 2

103
EXAMPLE 2.12
No. of text No. of student Class Midpoint, f⋅x
message sent (frequency, f) x
0 – 49 10 24.5 245.0 6002.50
50 – 99 5 74.5 372.5 27751.25
100 – 149 13 124.5 1618.5 201503.25
150 – 199 11 174.5 1919.5 334952.75
200 – 249 7 224.5 1571.5 352801.75
250 – 299 4 274.5 1098.0 301401.00

6825 1224412.5
 

104
2.6.4 STANDARD DEVIATION
• Because the variance is computed using squared deviations, the units of the variance
are the squared units of the data.
• For example, in Battery Lifetime example, the units of the data are hours, and the
units of variance are squared hours.
• In most situations, it is better to use a measure of spread that has the same units as the
data.
• We do this simply by taking the square root of the variance. This quantity is called
the standard deviation.
• The standard deviation of a sample is denoted s, and the standard deviation of a
population is denoted by σ.
 
105
Important properties of standard
deviation
• The standard deviation is a measure of variation of all values from the mean.
• The value of the standard deviation is usually positive (it is never negative).
• The value of the standard deviation can increase dramatically with the
inclusion of one or more outliers (data values far away from all others).
• The units of the standard deviation are the same as the units of the original
data values.

106
Comparing Standard Deviations

Smaller standard deviation

Larger standard deviation


Summary Characteristics
▪ The more the data are spread out, the greater the range, variance, and standard
deviation.

▪ The more the data are concentrated, the smaller the range, variance, and
standard deviation.

▪ If the values are all the same (no variation), all these measures will be zero.

▪ None of these measures are ever negative.


2.6.5 THE COEFFICIENT OF
VARIATION
• Measures relative variation.
• Always in percentage (%).
• Shows variation relative to mean.
• Can be used to compare the variability of two or more sets of
data measured in different units.
EXAMPLE 2.13 Comparing Coefficients of
Variation
• Stock A:
– Mean price last year = $50.
– Standard deviation = $5.

Both stocks have


the same
• Stock B: standard
deviation, but
– Mean price last year = $100. stock B is less
– Standard deviation = $5. variable relative
to its mean price.
Comparing Coefficients of Variation (con’t)
• Stock A:
– Mean price last year = $50.
– Standard deviation = $5.

Stock C has a
much smaller
• Stock C: standard
deviation but a
– Mean price last year = $8. much higher
– Standard deviation = $2. coefficient of
variation
Conclusions: Measures of Dispersion
Data
Measuremen
t
Ungrouped Grouped

Range
Interquartile
IQR = Q3 – Q1
range

Variance

Standard
deviation 112
2.7 MEASURE OF
SKEWNESS/SHAPE
• Describes how data are distributed.
• Two useful shape related statistics are:
– Skewness:
– Measures the extent to which data values are not symmetrical.
– Kurtosis:
– Kurtosis measures the peakedness of the curve of the distribution—that
is, how sharply the curve rises approaching the center of the
distribution.
2.7.1 COEFFICIENT OF SKEWNESS
• To determine the skewness of the data

– If the value = +ve → right skewed


– If the value = -ve → left skewed
– If the value = 0 → symmetry
• Measures the extent to which data is not symmetrical.

Left-Skewed Symmetric Right-Skewed


Mean < Median Mean = Median Median < Mean

Skewness
<0 0 >0
Statistic
2.7.2 KURTOSIS
Measures how sharply the curve rises approaching the center of the distribution

Sharper Peak
Than Bell-Shaped
(Kurtosis > 0)

Bell-Shaped
(Kurtosis = 0)
Flatter Than
Bell-Shaped
(Kurtosis < 0)
The Five Number Summary
The five numbers that help describe the center, spread and shape of data are:
▪ Xlargest.
▪ Third Quartile (Q3).
▪ Median (Q2).
▪ First Quartile (Q1).
▪ Xsmallest.
• These summaries are more informative when it is displayed on a diagram drawn to
scale.
• A graphic display that accomplishes this is known as box-and-whiskers display
(boxplot)
Five Number Summary and
The Boxplot
• The Boxplot: A Graphical display of the data based on the five-number
summary:

Xsmallest -- Q1 -- Median -- Q3 -- Xlargest


Example:

25% of data 25% 25% 25% of data


of data of data

Xsmallest Q1 Median Q3 Xlargest


Calculating The Interquartile Range

Example:
X Median X
minimum Q1 (Q2) Q3 maximum

25% 25% 25%


25%
12 30 45 57
70

Interquartile range
= 57 – 30 = 27
Five Number Summary:
Shape of Boxplots
DCOVA
• If data are symmetric around the median then the box and central
line are centered between the endpoints.

Xsmallest Q1 Median Q3 Xlargest

• A Boxplot can be shown in either a vertical or horizontal


orientation.
Distribution Shape and
The Boxplot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q 3 Q1 Q2 Q3
Chapter Summary
In this chapter we covered:
• Organizing categorical variables.
• Organizing numerical variables.
• Visualizing categorical variables.
• Visualizing numerical variables.
• Describing the properties of central tendency, variation, and shape in
numerical variables.
• Constructing and interpreting a boxplot.

You might also like