Professional Documents
Culture Documents
Chapter 2
Chapter 2
DESCRIPTIVE DATA
SQQS1013 ELEMENTARY
STATISTICS
ORGANIZING AND
VISUALIZING DATA
Objectives
In this chapter you learn:
• Organizing categorical variables.
• Organizing numerical variables.
• Visualizing categorical variables.
• Visualizing numerical variables.
• Organizing and visualizing a mix of variables.
• The challenge in organizing and visualizing variables.
2.1 INTRODUCTION
Example
Here is a list of questions asked in a large statistics class and the data value
given by one of the students:
i. What is your sex (m=male, f=female)? m
ii. How many hours did you sleep last night? 5 hours
iii. Randomly pick a letter, S or Q. S
iv. What is your height in inches? 67 inches
v. What’s the fastest you’ve ever driven a car (mph)? 110 mph
Raw data - Data recorded in the sequence in which
they were originally collected, before being
processed or ranked.
• Visual summaries enable rapid review of larger amounts of data & show
possible significant patterns.
Tallying Data
One Two
Categorical Categorical
Variable Variables
Summary Contingency
Table Table
• Summary Table tallies the frequencies or percentages of items in a set of
categories so that you can see differences between categories.
– Used to study patterns that may exist between the responses of two or more categorical
variables.
– For two variables the tallies for one variable are located in the rows and the tallies for the
second variable are located in the columns
Example 2.1: Contingency Table
• A random sample of 400 invoices is Table 2.4 Contingency Table Showing
drawn. the frequency of Invoices Categorized
by Size and the Presence of Errors
• Each invoice is categorized as a No
small, medium, or large amount. Errors Errors Total
• Each invoice is also examined to Small 170 20 190
Amount
identify if there are any errors.
Medium 100 40 140
• This data are then organized in the Amount
contingency table, as in the right Large 65 5 70
place. Amount
Total 335 65 400
Contingency Table based on Percentage of Overall Total
No DCOVA
Errors Errors Total
42.50% = 170 / 400
Small 170 20 190 25.00% = 100 / 400
Amount
16.25% = 65 / 400
Medium 100 40 140
Amount
No
Large 65 5 70 Errors Errors Total
Amount
Small 42.50% 5.00% 47.50%
Total 335 65 400 Amount
Medium 25.00% 10.00% 35.00%
Amount
83.75% of sampled invoices
Large 16.25% 1.25% 17.50%
have no errors and 42.50% Amount
of sampled invoices are for Total 83.75% 16.25% 100.0%
small amounts.
Contingency Table based on Percentage of Row Totals
No DCOVA
Errors Errors Total 89.47% = 170 / 190
Small 170 20 190 71.43% = 100 / 140
Amount
92.86% = 65 / 70
Medium 100 40 140
Amount
No
Large 65 5 70 Errors Errors Total
Amount
Small 89.47% 10.53% 100.0%
Total 335 65 400 Amount
Medium 71.43% 28.57% 100.0%
Amount
Medium invoices have a larger Large 92.86% 100.0%
chance (28.57%) of having Amount 7.14%
errors than small (10.53%) & Total 83.75% 16.25% 100.0%
Visualizing Data
Summary Contingency
Table for Table for
One Variable Two Variables
Source: Data extracted from A. Bhalla, “Don’t Misuse the Pareto Principle,” Six Sigma
Forum
Magazine, May 2009, pp. 15–18.
The Pareto Chart (con’t) DCOVA
The “Vital
Few”
Multiple (Side By Side) Bar Charts
▪ The side by side bar chart represents the data from a contingency table. DCOVA
No
Errors Errors Total
Small 50.75% 30.77% 47.50%
Amount
Medium 29.85% 61.54% 35.00%
Amount
Large 19.40% 7.69% 17.50%
Amount
Total 100.0% 100.0% 100.0%
No
Errors Errors Total
Small 50.75% 30.77% 47.50%
Amount
Medium 29.85% 61.54% 35.00%
Amount
Large 19.40% 7.69% 17.50%
Amount
Total 100.0% 100.0% 100.0%
ii.
2.3 PRESENTATION
OF
QUANTITATIVE
DATA
2.3.1 Organizing Quantitative Data
Numerical Data
ii. determining a suitable width of a class, and establishing the boundaries of each class to
avoid overlapping.
c shall be
i must always be
rounded-up or
rounded-up
rounded down
iii. Starting point of the 1st class
=> use the smallest value in the data set.
Example 2.2
Frequency Distribution Example
A manufacturer of insulation randomly selects 20 winter
days and records the daily high temperature.
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
Solution Example 2.2
•
Solution Example 2.2 (con’t)
Data in Ordered Array
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class Frequency
12 - 21 4
22 - 31 6
32 - 41 5
42 - 51 3
52 - 61 2
Total 20
Solution Example 2.2 (con’t)
Relative
Class Frequency Frequency Percentage
12 - 21 4 0.20 20%
22 - 31 6
0.30 30%
32 - 41 5
0.25 25%
42 - 51 3 0.15 15%
52 - 61 2 0.10 10%
Cumulative Cumulative
Class Frequency Percentage Frequency Percentage
12 - 21 4 20% 4 20%
22 - 31 630% 10 50%
32 - 41 525% 15 75%
15% 18 90%
42 - 51 3
10% 20 100%
52 - 61 2
100%
100%
Total 20
Cumulative Percentage = (Cumulative Frequency / Total) * 100 ; (10/20) * 100 = 50%
Why Use a Frequency Distribution?
• It condenses the raw data into a more useful form.
• It allows for a quick visual interpretation of the data.
• It enables the determination of the major characteristics of the data set including where the
data are concentrated/clustered.
Frequency Distributions
Ordered Array and
Cumulative Distributions
Stem-and-Leaf
Display Histogram Polygon Ogive
Stem-and-Leaf Display
A simple way to see how the data are distributed and where concentrations of
data exist.
▪ The class boundaries (or class midpoints) are shown on the horizontal axis.
▪ The height of the bars represent the frequency, relative frequency, or percentage.
The Histogram
Relative Percentage
Class Frequency Frequency
12 - 21 3 0.15 15
22 - 31 6 0.30 30
32 - 41 5 0.25 25
42 - 51 4 0.20 20
52 - 61 2 0.10 10
46
Ogive
47
2.3.3 Visualizing Two Numerical
Variables
Two Numerical
Variables
Scatter Time-Series
Plot Plot
The Scatter Plot
▪ Scatter plots are used for numerical data consisting of paired observations
taken from two numerical variables.
▪ One variable is measured on the vertical axis and the other variable is
measured on the horizontal axis.
▪ The shape is the pattern of the distribution of values from the lowest value to
the highest value.
2.4 MEASURE OF
CENTRAL
TENDENCY
2.4.1 MEAN
2.4.1.1 UNGROUP DATA
•
For a sample of size n:
Pronounced x-bar
The ith value
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Mean = 13 Mean = 14
2.4.1 MEAN
2.4.1.2 GROUP DATA
Mid-point of a
Total of class
Frequency of a
frequency/
class
Sample size
EXAMPLE 2.3
a. During a semester, a student took five exams. The population of exam
scores is 78, 83, 92, 68, and 85. Find the mean. (406, 81.2)
41 53 58 67 33 61 43 45 42 67
39 48 36 47 34 59 57 54 65 69
63 42 60 48 66 30 30 46 52 49
c) The following table presents the daily high temperature in a manufacturer of
insulation for randomly selected 20 winter days (Refer Example 2.2).
Approximate the mean of daily high temperature. (34.5)
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Median = 13 Median = 13
Class width
Cumulative freq
before a class
Total freq
median
Lower boundary
of class median Freq of a class median
EXAMPLE
2.4
a. During a semester, a student took five exams. The population of exam scores
is 78, 83, 92, 68, and 85. Find the median. (83)
b. One of the goals of medical research is to develop treatments that reduce the
time spent in recovery. Eight patients undergo a new surgical procedure, and
the number of days spent in recovery for each is as follows. Find the
median. (17)
c. The following table presents the daily high temperature in a manufacturer of
insulation for randomly selected 20 winter days(Refer to Example 2.2).
Approximate the median of daily high temperature. (33.5)
Total 20
2.4.3 MODE
2.4.3.1 UNGROUP DATA
DCOVA
• Value that occurs most often.
• Not affected by extreme values.
• Used for either numerical or categorical data.
• There may be no mode.
• There may be several modes.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode = 9 No Mode
2.4.3 MODE
2.4.3.2 GROUP DATA
• Determine class mode (or, modal class) - the class with the highest frequency.
• Use the following formula
Class width
Lower
boundary of the difference between the
class median frequency of class mode and the
frequency of the class before
the class mode
Approximating mode using histogram
-0.5 49.5 99.5 149.5 199.5 249.5 299.5 No. of text messages
MODE = 140 71
EXAMPLE 2.5
a. Ten students were asked how many siblings they had. The results, arranged
in order, were
0111122336
Find the mode of this data set.(1).
b. The following table presents the daily high temperature in a manufacturer of
insulation for randomly selected 20 winter days(Refer Example 2.2).
Approximate the mode of daily high temperature. (29.0)
Class Frequency
12 - 21 3
22 - 31 6 Class Mode → highest freq.
32 - 41 5
42 - 51 4
52 - 61 2
Total 20
Which Measure to Choose?
DCOVA
▪ The mean is generally used, unless extreme values (outliers) exist.
▪ The median is often used, since the median is not sensitive to extreme values.
For example, median home prices may be reported for a region; it is less
sensitive to outliers.
▪ In many situations it makes sense to report both the mean and the median.
Describing the Shape of a Data Set
• The mean and median measure the center of a data set in different ways.
When a data set is symmetric, the mean, median and mode are equal.
• When a data set is skewed to the right, there are large values in the right tail.
Because the median is resistant while the mean is not, the mean is generally
more affected by these large values. Therefore for a data set that is skewed to
the right, the mean is often greater than the median greater than the mode.
• Similarly, when a data set is skewed to the left, the mean is often less than the
median less than the mode.
75
i. Approximately Symmetric
Shape: Approximately Symmetric
Relationship Between
the Mean, Median and Mode: Mean, median and mode are approximately the same
76
ii. Skewed to the Right
Shape: Skewed to the Right
Relationship Between
the Mean, Median and Mode : Mean is noticeably greater than the median greater than the
mode.
77
iii. Skewed to the Left
Shape: Skewed to the Left
Relationship Between
the Mean, Median and Mode: Mean is noticeably less than the median less than the mode.
78
Summary of Measure of Central
Tendency
Data
Measure
Ungrouped Grouped
Mean
Median
79
2.5 MEASURE OF
POSITION
80
DCOVA
Position
Percentiles Quartiles
Measures of position are techniques that divide a set of data into equal groups.
To determine the measurement of position, the data must be sorted from lowest to highest. The different
measures of position are percentiles and quartiles
2.5.1 PERCENTILES
• The mean and median of a data set describe the center of a distribution
(quantitative).
• For some data it is often useful to compute measures of positions other than
the center, to get a more detailed description of the distribution.
• Percentiles provide a way to do this. Percentiles divide a data set into
hundredths.
• Definition: For a number p between 1 and 99, the pth percentile separates the
lowest p% of the data from the highest (100 – p)%.
82
2.5.1 PERCENTILES
UNGROUPED DATA
• First, the data need to be arranged in increasing order.
– If L is a whole number, then the pth percentile is the average of the number in position L and the number in position (L+1).
– If L is not a whole number, round it up to the next higher whole number. The pth percentile is the number in the position corresponding to the
rounded-up value.
– Round the result to the nearest whole number.
83
EXAMPLE 2.6
A teacher gives a 20-points test to 10 students. The scores are shown here.
18 15 12 6 8 2 3 5 20 10
1. Find the value corresponding to the 25th and 60th percentile (5, 11).
2 3 5 6 8 10 12 15 18 20
2. Find the percentile rank of a score of 6 and 12 (35, 65).
84
2.5.2 QUARTILES
• There are 3 percentiles that are used more often than the others - the 25th, the
50th, and the 75th .
• These percentiles divide the data into 4 parts, each of which contains
approximately one quarter of the data.
• Thus, these 3 percentiles are called quartiles.
• Can visualize the distribution of the values for a numerical variable by
computing:
– The quartiles.
– The five-number summary.
– Constructing a boxplot.
85
DCOVA
2.5.2 QUARTILE MEASURES
2.5.2.1 UNGROUPED DATA
• Quartiles split the ranked data into 4 segments with an equal number of values
per segment.
Q1 Q2 Q3
■ The first quartile, Q1, is the value for which 25% of the
values are smaller and 75% are larger.
■ Q2 is the same as the median (50% of the values are
smaller and 50% are larger).
■ Only 25% of the values are greater than the third quartile -
separates the lowest 75% of the data from the highest 25%.
•
2.5.1 QUARTILE MEASURES
2.5.1.2 GROUPED DATA
•
EXAMPLE 2.7
• Following are final exam scores, arranged in increasing order for 28 students.
58 59 62 64 67 68 69 71 73 74 74 75 76 76
76 77 78 78 78 82 82 84 86 87 87 88 91 97
89
EXAMPLE 2.8
The following table presents the daily high temperature in a manufacturer of
insulation for randomly selected 20 winter days(Refer Example 2.2). Calculate
the Q1 and Q3.
Percentiles −
1st Quartile
3rd Quartile 91
2.6 MEASURE OF
DISPERSION
DCOVA
Variation
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 13 - 1 = 12
2.6.1 THE RANGE
2.6.1.2 GROUP DATA
Class Frequency
41 – 50 1 Upper bound of last class =
51 – 60 3 100.5
61 – 70 7 Lower bound of first class =
71 – 80 13 40.5
81 – 90 10
Range = 100.5 – 40.5 = 60
91 - 100 6
Total 40
Why The Range Can Be Misleading
DCOVA
▪ Does not account for how the data are distributed.
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
▪ Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
EXAMPLE 2.9
The following table presents the average monthly temperature, in degrees Fahrenheit, for the
cities of San Francisco and St. Louis. Compute the range for each city.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
San 51 54 55 56 58 60 60 61 63 62 58 52
Francisco
St. Louis 30 35 44 57 66 75 79 78 70 59 45 35
Source: National Weather Service
Solution:
The range for San Francisco is 63 – 51 = 12.
IQR = Q3 – Q1
98
EXAMPLE 2.10
Table below list the total revenue for the 12 top tourism company in Malaysia
109.7 79.9 74.1 121.2 76.4 80.2 82.1 79.4 89.3 98.0 103.5 86.8
Determine the interquartile of the data (79.5, 102.1, 22.6)
74.1 76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7 121.2
99
2.6.3 VARIANCE
• Although the range is easy to compute, it is not often used in practice. The reason is
that the range involves only two values from the data set; the largest and smallest.
• The measures of spread that are most often used are the variance and the standard
deviation, which use every value in the data set.
• When a data set has a small amount of spread, like the San Francisco temperatures,
most of the values will be close to the mean. When a data set has a larger amount of
spread, more of the data values will be far from the mean.
• The variance is a measure of how far the values in a data set are from the mean, on
the average.
• The variance is computed slightly differently for populations and samples.
100
Population Sample
• In the formula, the mean μ is replaced by
the sample mean and the denominator is n
– 1 instead of N. The sample variance is
denoted by s2.
101
Sample Variance
Ungrouped Grouped
• •
102
EXAMPLE 2.11
A company that manufactures batteries is testing a new type of battery designed for laptop
computers. They measure the lifetimes, in hours, of six batteries, and the results are presented in
the following table. Find the variance of the lifetimes. (2)
Battery Lifetime 3 4 6 5 4 2
103
EXAMPLE 2.12
No. of text No. of student Class Midpoint, f⋅x
message sent (frequency, f) x
0 – 49 10 24.5 245.0 6002.50
50 – 99 5 74.5 372.5 27751.25
100 – 149 13 124.5 1618.5 201503.25
150 – 199 11 174.5 1919.5 334952.75
200 – 249 7 224.5 1571.5 352801.75
250 – 299 4 274.5 1098.0 301401.00
6825 1224412.5
104
2.6.4 STANDARD DEVIATION
• Because the variance is computed using squared deviations, the units of the variance
are the squared units of the data.
• For example, in Battery Lifetime example, the units of the data are hours, and the
units of variance are squared hours.
• In most situations, it is better to use a measure of spread that has the same units as the
data.
• We do this simply by taking the square root of the variance. This quantity is called
the standard deviation.
• The standard deviation of a sample is denoted s, and the standard deviation of a
population is denoted by σ.
105
Important properties of standard
deviation
• The standard deviation is a measure of variation of all values from the mean.
• The value of the standard deviation is usually positive (it is never negative).
• The value of the standard deviation can increase dramatically with the
inclusion of one or more outliers (data values far away from all others).
• The units of the standard deviation are the same as the units of the original
data values.
106
Comparing Standard Deviations
▪ The more the data are concentrated, the smaller the range, variance, and
standard deviation.
▪ If the values are all the same (no variation), all these measures will be zero.
Stock C has a
much smaller
• Stock C: standard
deviation but a
– Mean price last year = $8. much higher
– Standard deviation = $2. coefficient of
variation
Conclusions: Measures of Dispersion
Data
Measuremen
t
Ungrouped Grouped
Range
Interquartile
IQR = Q3 – Q1
range
Variance
Standard
deviation 112
2.7 MEASURE OF
SKEWNESS/SHAPE
• Describes how data are distributed.
• Two useful shape related statistics are:
– Skewness:
– Measures the extent to which data values are not symmetrical.
– Kurtosis:
– Kurtosis measures the peakedness of the curve of the distribution—that
is, how sharply the curve rises approaching the center of the
distribution.
2.7.1 COEFFICIENT OF SKEWNESS
• To determine the skewness of the data
Skewness
<0 0 >0
Statistic
2.7.2 KURTOSIS
Measures how sharply the curve rises approaching the center of the distribution
Sharper Peak
Than Bell-Shaped
(Kurtosis > 0)
Bell-Shaped
(Kurtosis = 0)
Flatter Than
Bell-Shaped
(Kurtosis < 0)
The Five Number Summary
The five numbers that help describe the center, spread and shape of data are:
▪ Xlargest.
▪ Third Quartile (Q3).
▪ Median (Q2).
▪ First Quartile (Q1).
▪ Xsmallest.
• These summaries are more informative when it is displayed on a diagram drawn to
scale.
• A graphic display that accomplishes this is known as box-and-whiskers display
(boxplot)
Five Number Summary and
The Boxplot
• The Boxplot: A Graphical display of the data based on the five-number
summary:
Example:
X Median X
minimum Q1 (Q2) Q3 maximum
Interquartile range
= 57 – 30 = 27
Five Number Summary:
Shape of Boxplots
DCOVA
• If data are symmetric around the median then the box and central
line are centered between the endpoints.
Q1 Q2 Q3 Q1 Q2 Q 3 Q1 Q2 Q3
Chapter Summary
In this chapter we covered:
• Organizing categorical variables.
• Organizing numerical variables.
• Visualizing categorical variables.
• Visualizing numerical variables.
• Describing the properties of central tendency, variation, and shape in
numerical variables.
• Constructing and interpreting a boxplot.