Chapter 3

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

Chapter 3 Summary – Displaying and Describing Quantitative Data

In chapter 2, you learned how to display and describe qualitative data and you learned that there
weren’t many ways for describing this type of data. When we have quantitative data, there are several
ways of converting them into information. In this chapter, we are concerned with ways of describing a
single quantitative variable while in chapter 4 we will look at ways of describing relationships between
two quantitative variables.
In this chapter, we will look at the following ways of displaying and describing a single quantitative
variable, not all of which are included in the text but for which you will still be responsible. The ways of
displaying and describing can be broken down into the following categories:
 displaying and describing the overall distribution of the data by:
o displaying with:
 histograms
 stem-and-leaf displays
o describing with:
 modes
 symmetry
 outliers
 centre (mean and median)
 spread (range, interquartile range, variance, standard deviation)
 five-number summary and boxplots
o comparing groups
o identifying outliers
o standardizing
 time series plots with time-series data

In addition to what is in the text, we will look at other ways of describing quantitative variables when:

 it is necessary to weight our data to calculate meaningful statistics


 we don’t have the original data, we only have data grouped into intervals

or, with:

 polygons
 ogives

displaying with histograms:

To display a quantitative variable with a histogram we would need to take a rather large number of
observations to be able to reach any conclusions about the overall shape of the data. Once we have
these data, we will create a number of mutually exclusive and collectively exhaustive intervals of equal
widths* and count how many observations fall into each of these intervals. (* If we have extreme values
in our data, we may create an open-ended interval or two and count those extremely small values as
belonging to one of these intervals and those extremely large values as belonging to the other of these
intervals, creating two intervals whose widths are different from the other intervals). Although there
are some rules of thumb which can be used to determine the number of intervals to use, there is no one
rule. No matter how many intervals we eventually use, we want to see somewhat of a smooth
distribution and have enough intervals to get a decent picture of what the distribution looks like. To
some extent, the number of intervals that will give us this smooth and decent picture depends on the
number of observations (the greater the number of observations, the greater the number of intervals
we can use).

The following frequency table was created from a data set of 191 observations of annual incomes of a
random sample of individuals:
No. of
Individual
Annual Income ($1000's) s
<= 0 0
0 to 20 3
20 to 40 15
40 to 60 45
60 to 80 65
80 to 100 45
100 to 120 15
120 to 140 3
>= 140 0
Totals 191

Drawing a histogram, with bars whose heights represent the number of individuals falling into each of
the intervals, we end up with the following picture of the income distribution of this sample:

Income Distribution

70
No. of Respondents

60

50

40

30

20

10

0
<= 0 0 to 20 20 to 40 40 to 60 60 to 80 80 to 100 100 to 120 120 to 140 > 140

Income ($1000's)

What should we look at when describing this picture of the income distributions of these 191 sampled
individuals. We should be first looking at the shape of this distribution. We can begin to describe the
shape of a histogram by its number of peaks (i.e., modes) and its symmetry. In this case, there is one
peak or mode and the distribution is symmetrical (i.e., one half of the histogram is a mirror image of the
other half). Although this picture conveys perfect symmetrical, having the actual data may indicate that
the distribution is less-than-perfectly symmetrical. (Another histogram may give us a picture that is not
perfectly symmetrical but close enough to being symmetrical that we will call it symmetrical.) In either
case, as long as the data indicates almost symmetry, we will assume the distribution to be symmetrical.
If the histogram’s picture does not indicate anything close to be symmetrical, we may call the
distribution skewed – either right or positively skewed (if the distribution tails off to the right or toward
the larger values), or, left or negatively skewed (if the distribution tails off to the left or toward the
smaller values). If we look at the spreadsheet ‘histograms’ in Excel file, 1000hist,polygon,ogive, we will
see 5 histograms in total including the above histogram. (How we describe each histogram is indicated
above each of these histograms.)

What else can we tell about this distribution? Because it is symmetrical, we can tell that the average
income of this group is somewhere between 60 and 80 ($1000’s) and the median is also in that same
range. Although we cannot guess without some calculations, the distribution of incomes appears to be
somewhat spread out.

Eventually, we will see how we can estimate some statistics based on this histogram or table without
knowing the actual values.

polygons & ogives

Polygons and ogives are two other graphs for depicting the distribution of a quantitative variable.
Polygons give us the same information as histograms except they use lines instead of bars as a means of
describing the shapes of distributions and they allow us to display the shapes of more than one
distribution in the same graph. The following polygon was created using the frequency table in the Excel
file 1000hist,polygon,ogive. This polygon gives us the distribution of incomes for each of the four
groups of families in this file, including the one group discussed above.

Income Distributions ($1000's) by Group


90

80

70

60
No. of Households

50 Group A
Group B
40 Group C
Group D
30

20

10

0
0 10 30 50 70 90 110 130 140

Income ($1000's)

By examining each of the above series of lines, we note that: group A’s income distribution is unimodal
and symmetrical; B’s distribution is unimodal and right-skewed; C’s is unimodal and left-skewed and D’s
is bimodal and symmetrical.

The ogive gives us the number (or percentages) of observations whose values are less than values which
appear along the horizontal axis. The below ogive gives us the cumulative relative frequencies for each
of the four groups of families in the file.
Cumulative Income Distributions ($1000's)
100%
95%
90%
85%
80%
75%
70%
65%
60%
Group A
Cumulative % of Households

55%
50% Group B
45%
Group C
40%
35% Group D
30%
25%
20%
15%
10%
5%
0%
0 20 40 60 80 100 120 140

Income ($1000's)

From this graph, we can, for example, determine what approximate percentage of the observations for
group A have values less than 50. We can do so by drawing a vertical line at 50 until it touches the blue
line and then draw a horizontal line, where it touches, until it reaches the vertical axis. We can see, by
doing so, that approximately 21% of the values are less than or less than or equal to 50. (We can also
use this ogive to estimate the median. Since the median is that value which separates the bottom 50%
of the values from the top 50% of the values, we draw a horizontal line at 50% until it touches the blue
line. Where it touches we draw a vertical line until it reaches the horizontal axis. We can see, by doing
so, that the median is approximately 70.)

Stem-and-Leaf Display

A stem-and-leaf display, if created carefully, can also give us some picture of a data set’s distribution
while still allowing us to see the original values. For the following example, we have the following
grades for a sample of 50 students in a very large accounting class:

41 71 52 60 62 57 50 69 77 82 65 61 43 58 62 68 61 59 65 61 55 70 53 48 58 78 63 61 51
61 59 87 73 42 52 54 84 64 58 68 50 95 52 55 40 51 59 50 54 74

To create a stem-and-leaf display, we first divide each mark into two parts, with the most important
parts being the stems (e.g., the most important part of the mark of 73 is the 7) and the other part of the
mark being the leafs (e.g., the less important part of the mark of 73 being the 3). We list the various
values of the stems down the left hand column and we list the various values of the leafs across the
rows which contain the leafs for that particular value. The completing of this exercise gives us the
following stem-and-leaf display.

4 13820
5 27089538192480251904
6 029512815131148
7 170834
8 274
9 5
Notes:

The blue outline is not really part of the stem-and-leaf display but it does give us some indication of the
distribution of marks if we use equal spacing when we enter the values of the leafs. This stem-and-leaf
display indicates that these marks are right-skewed because the tail of the distribution is to the right or
where the higher values are.

While the text orders the values within each stem, it is not necessary to do so although some
information may be more difficult to determine (e.g., the median of the data set).

The following histogram for this same set of data also indicates a right-skewed distribution.

Distribution of Marks
20

18

16

14
no. of students

12

10

0
40 to 49 50 to 59 60 to 69 70 to 79 80 to 89 90 to 100

marks

Although the stem-and-leaf display is not as flexible in creating intervals as is the histogram, the stem-
and-leaf display has some flexibility. For example, we can break up each of the above stems into two
separate stems. For example, we can break up the 40 – 49 grades into two stems: a stem allowing for
the grades between 40 and 44 and a stem allowing for the grades between 45 and 49. (Each stem must
be created so that they are of equal ‘width’. In this example, both the 40 – 44 stem allows for five digits
(0, 1, 2, 3, 4) and the 45 - 49 stem allows for five digits (5, 6, 7, 8, 9)).

Using this revised stem-and-leaf display, we end up with:

4 1320
4 8
5 20312402104
5 789589859
6 0212113114
6 95858
7 1034
7 78
8 24
8 7
9
9 5

Observation:
Although this latter stem-and-leaf display still indicates a right-skewed distribution, its shape is not as
‘smooth’ as when we used less stems. (With only 50 observations, we might expect that with 12
intervals.) Therefore, we would probably stay with our 6 ‘interval’ stem-and-leaf display.

Means, Medians & Modes

We already used mode when discussing the histogram. But we used the mode to identify the peaks of
the distribution to determine whether there is one mode (unimodal), two modes (bimodal), more than
two modes (multimodal). When determining the number of modes when the data are put into intervals,
we may not care whether the peaks are of the same height.

The actual definition of a mode is that value in our data set which occurs most often. This definition
may result in the data set having no modes (if each value occurs the same number of times); one mode
if one value occurs more often than any other value); or, more than one mode if more than one value
occurs the same number of times and occurs more than any other value. In our marks example, since
the values are not ordered within each leaf, finding the mode or modes becomes more tedious than if
they were ordered. Examining the data, it appears that 61 occurs the most as it occurs 5 times.
Therefore, the mode = 61.

The median is defined as that value in the data set which separates the bottom half of the values from
the top half of the values. If the data are ordered from smallest to largest or from largest to smallest,
n+1
the median occurs at observation 2 . For example, if we have 45 observations ordered from
45+ 1
smallest to largest, the median is the value at observation 2 or at the 23rd observation. In our
50+1
marks example, n = 50 and, thus, the median occurs at the 2 or at the 25.5th observation. Since
there is no such observation, we would average the values of the 25th and 26th observation. Since our
values in the stem-and-leaf display are not completely ordered, once we determined in which leaf or
leafs the 25th and 26th observations fall, we will need to do some ordering to find these two
observations. The 25th observation is the largest value in the stem 5 and the 26th observation is the
59+60
smallest value in the stem 6. These two values are 59 and 60, respectively, giving us a median of 2
, or, the median = 59.5.

The mean (aka arithmetic mean) is defined as the arithmetic average of a variable’s values in the data
set. If the data set is a sample data set, the symbol of the mean is ȳ (in this text), and if the data set is a
population data set, the symbol for this mean is μ or μy. Their respective equations are:

ȳ=
∑ yi and μ=
∑ yi
n N

where n = sample size, and, N = population size


Obviously, this is more time consuming to calculate the mean mark than either the mode or the median
without the use of a computer. If the calculations are correct,

ȳ=
∑ y i =3043 =60 . 86
n 50 or the mean mark is 60.86

Five simple examples:

Ex. 1: 1 2 3 4 5 mean = 3 median = 3 mode = no mode


Ex. 2: 1 1 3 5 5 mean = 3 median = 3 mode = 1 & 5
Ex. 3: 1 3 3 3 5 mean = 3 median = 3 mode = 3
Ex. 4: 1 3 5 7 14 mean = 6 median = 5 mode = no mode
Ex. 5: 1 8 9 11 11 mean = 8 median = 9 mode = 11

Notes:
1. If the distribution is symmetrical: mean = median
2. If the distribution is symmetrical with one mode: mean = median = mode
3. If the distribution is right-skewed: mean > median
4. If the distribution is left-skewed: mean < median

(The mean is affected by extreme values because the mean uses all values in the data set. The median
only considers the middle values in the data set and is not affected by these extreme values. Therefore,
if we have a right-skewed distribution, the mean will be larger because of the larger values in the tail,
while the median will not be affected, causing the mean to be greater than the median. The opposite
will happen if the distribution is left-skewed.)

Outliers and Centre

As mentioned above, the mean is affected by values at the lower end and at the upper end of the values
in the data set while the median is not. If the lower values are offset by the upper values, the mean and
median may still have similar values. Values that are extremely small or extremely large are sometimes
referred to as ‘outliers’ and these extreme values tend to be in one direction. The mean can be greatly
affected by these values and result in the sample mean being a poor estimate of the population mean
especially when the sample size is small.

If a data set has an extreme value and one can justify eliminating that value from the data set, the
sample mean should become a better estimate of the population mean. For example, suppose we want
to estimate the average starting salaries of recent business graduates at the University of Windsor and
we took a random sample of 6 of these graduates and determined their salaries to be (in $1000’s):

45 42 52 38 47 and 3,500

By putting these values in order from smallest to largest:

38 42 45 47 52 3,500

The median starting salary of this sample is 46 ($1000’s) and the mean is 620.7 ($1000’s), quite a
difference! Obviously the salary of 3,500 ($1000’s) is an outlier. But, what should we do about it? First
of all, we should try to figure out why this value was so different from the other values. If we can find a
rationale explanation, we may be able to argue that this value should be eliminated from our sample
when determining the mean and median. For example, the starting salary of this individual had nothing
to do with him being a business graduate. It had to do with him being 7 feet tall and being an excellent.
basketball player. This would be a legitimate reason for removing him from the sample, resulting in a
new median of 45 ($1000’s) (quite similar to the original median) and a new mean of 44.8 ($1000’s),
quite a difference from the original mean! So, we would probably use the sample mean as the estimate
of the population mean. If there is no legitimate reason for removing this salary from the data set, the
original sample median of 47($1000’s) should be used as a better estimate of the population’s ‘centre’.
(The text argues, rightly or wrongly, that the median should be used as an estimate of average or centre
if the distribution is not symmetrical with or without outliers in the data set.)

The text briefly discusses the harmonic mean and the geometric mean. (Only the geometric mean will
be discussed here and, thus, you will not be responsible for the harmonic mean.) The geometric mean is
used, for example, to calculate the average rate of growth where the growth rates are measured over
time as opposed to the average growth rate of 10 stocks in 2016.

Example

Suppose you invest $1000 in some stock. In the first year you earn a return of 100% and in the second
year your rate of return is – 50%. What is your average rate of return over this two year period? Using
100+(−50 )
=25 %
the arithmetic mean, your average rate of return is 2 . Would you be happy with that
rate of return? How much money would your investment turn into after two years?

In the first year your original $1000 would turn into $2000 with your 100% rate of return. But, after the
second year, your original $1000 is again only worth $1000 with your -50% rate of return. So, actually
you earned nothing on your investment, making your arithmetic mean very deceptive. The geometric
mean is a true measure of the average rate of return in that it measures the average growth rate of
some variable over time. There are a couple of ways of calculating average growth rates when the
measures used to determine growth rates are dependent on each other.

One such formula is:


GM =

n

growth rate is being calculated.


ending value
beginning value
−1
where n = number of periods for which the average

In this example, using this formula:

GM =

2 1000
1000
2
−1=√1−1=1−1=0
or 0%

Another formula is: GM =√n (1+R1 )(1+ R2 )...(1+R n )−1 where the R’s are expressed as decimals and
not percentages

Using this formula for this example:


2
GM =√(1+1)(1−. 5 )−1= √1−1=0 or 0%
A less extreme example

You invested in a stock and sold it after 3 years. In the first year your rate of return was 20%. In the
second year it was 8% and in the third year it was – 4%, giving you an arithmetic average return of 8%.
The true average rate of return is:

GM =3√(1+. 20)(1+. 08)(1−.04 )−1=√3 1. 24416−1≈1 . 07554−1≈. 07554 or 7.554%

Weighted means and group means

Suppose a small business has 3 hourly employees. Fred make $12.00 per hour, Sue makes $14.00 per
hour and Bob makes $16.00 per hour. What is the average hourly wage this business paid its 3 hourly
employees last week?

Using the arithmetic mean, the hourly wage would be calculated to be $14.00 per hour. But, there is a
problem with this average if some employees worked more hours than others. For example, if Fred
worked 40 hours while Sue worked 30 hours and Bob worked 20 hours, it makes sense to conclude that
the average wage would be less than $14.00 per hour because more hours paid $12.00 than $14.00 or
$16.00 per hour. To arrive at the actual average wage per hour, we should weight the different wages
by the number of hours worked at each different wage. Without first giving a formula for this weighted
average, the following table should make sense when determining this average.

Hourly wages no. of hours (hourly wages)(no. of hours)

$12.00 40 $480.00
$14.00 30 $420.00
$16.00 20 $320.00
90 $1220.00

Therefore the average hourly wage should be the total wages divided by the total number of hours, or,
the average hourly wage is:

$ 1220 . 00
≈$ 13 . 56
90

If we were to create a formula for the weighted mean, it would be:

ȳ w =
∑ yi w i
∑ wi
where,
ȳ w = weighted mean (e.g. weighted mean of wages per hour)
yi = each value of y, where y is the variable of interest (e.g., wages per hour)
wi = appropriate weight for yi (e.g., no of hours worked)

When we no longer have the original values of y and only have the number of observations with y’s
falling into each of the intervals, we would need to estimate what those values are within each of the
intervals if we want to estimate the mean of the data set, or we need to estimate the mean with
grouped data.

Suppose we have the following table of the marks for the above sample of 50 students in the accounting
class:

Marks No. of Students


40 – 49 5
50 – 59 20
60 – 69 15
70 – 79 6
80 – 89 3
90 – 99 1
50

If we want to estimate the values within each interval, two logical assumptions would be to assume that
the marks within each interval are evenly spread out within the interval, or, that all the marks within
each interval have the midpoint of that interval as their values. (Although the former assumption seems
more reasonable, using the midpoints of each interval would result in the same answer.) With this
assumption, the formula for the mean with grouped data is similar to the weighted mean, with the
weights being the number of observations within each interval (the f j’s) and the values of y being
replaced by the midpoints (the mj’s. Duplicating the format of the table for the weighted mean
example:

Marks midpoints, mj No. of Students, fj mjfj


40 – 49 44.5 5 222.5
50 – 59 54.5 20 1090.0
60 – 69 64.5 15 967.5
70 – 79 74.5 6 447.0
80 – 89 84.5 3 253.5
90 – 99 94.5 _1 94.5
50 3075.0

With these calculations, the estimated mean mark is:

3075 . 0
≈61. 50
50

This value was essentially calculated using the formula:

∑mjf j
ȳ g =
n where n = ∑fj
Observations:
1. This estimated mean is quite close to the previously calculated sample mean of 60.86.
2. The most that this estimated mean can differ from the actual sample mean is ± 4.5.
3. If the data were continuous data, or, if marks did not need to be integers, the intervals would be
stated differently (e.g., the 40 – 49 interval would be stated as 40 – 50, resulting in the midpoint
of this interval being 45 instead of 44.5, and the estimated average being 62.00)
4. With open-ended classes, there is, theoretically, no midpoints for these classes requiring us to
either come up with reasonable midpoints for these classes or, simply, not estimating the mean.

Back to raw data – measures of variability (i.e., the spread of the distribution)

Ask a quick review of measures of centre (aka, averages), using the five examples below and their 5
pieces of data, the mean, median and mode for each are easily determined as follows:

Ex. 1: 1 2 3 4 5 mean = 3 median = 3 mode = no mode


Ex. 2: 1 1 3 5 5 mean = 3 median = 3 mode = 1 & 5
Ex. 3: 1 3 3 3 5 mean = 3 median = 3 mode = 3
Ex. 4: 1 3 5 7 14 mean = 6 median = 5 mode = no mode
Ex. 5: 1 8 9 11 11 mean = 8 median = 9 mode = 11

Observations:
 Examples 1, 2 and 3 have symmetrical distributions. Therefore, their means equal their
medians.
 Example 3 is also unimodal. Therefore, its mean, median and mode are all equal
 Example 4 is right-skewed. Therefore, its mean is greater than its median.
 Example 5 is left-skewed. Therefore, its mean is less than its median.
 Data sets can have no modes, one mode, or, more than one mode.

While measures of ‘centre’ give us important information, measure of spread or variability also give us
important information. They measure how different the values are within a data set and, therefore, give
us some measure of uncertainty.

Without calculating a specific measure of spread, an educated guess as to the order, from less to more
spread out, of the spreads of each of the above examples, this list may look like:

Example 3 (least spread out because data are bunched in the middle of the data set)
Example 1 (data are ‘evenly’ spread out)
Example 2 (data are bunched at the ends of the data set)
Example 5 (bigger difference between the largest and smallest values in the data set)
Example 4 (most spread out as the difference between the largest and smallest values is biggest)

Observation:

Generally speaking,
 Data that are bunched in the middle of the data set tends to reduce our perception of the
spread of the data set
 Data that are bunched at the ends of the data set tends to increase our perception of the spread
of the data set
 The greater the difference between the largest and smallest values in the data set tends to
increase our perception of the spread of the data set
 But, our perceptions become less clear when we have conflicting evidence (e.g., data bunched in
the middle of the data set and the difference between the largest and smallest values is greater)

There are various measures of spread and some measures may support the above list and some may
not. The simplest of the measures of spread is the range, where the range is defined as follows:

Range = max – min or Range = largest value – smallest value

Using our 5 examples:

Ex. 1: 1 2 3 4 5 range = 5 – 1 = 4
Ex. 2: 1 1 3 5 5 range = 5 – 1 = 4
Ex. 3: 1 3 3 3 5 range = 5 – 1 = 4
Ex. 4: 1 3 5 7 14 range = 14 – 1 = 13
Ex. 5: 1 8 9 11 11 range = 11 – 1 = 10

These range values support are educated guesses for examples 4 and 5, but don’t totally support our
educated guesses for the first 3 examples as the range ignores any values other than smallest and
largest values.

Another measure, both similar and different than the range is the interquartile range (IQR). Similar
because it only uses two values in the data set and different because it only looks at the range of the
middle 50% of the data. To determine this range, we need to determine Q1 and Q3. Q1 (aka the first or
lower quartile) separates the bottom 25% of the values from the top 75% of the values) and Q3 (aka the
third quartile or upper quartile) separates the bottom 75% of the values from the top 25% of the values.
(For those who are curious, Q2 separates the bottom 50% of the values from the top 50%, or, Q2 is the
median.)

With these definitions of IQR, Q1 and Q3,

Interquartile range, IQR = Q3 – Q1

There is some disagreement as to how to determine the quartiles when one has only a few observations
unless the number of observations are divisible by 4. For example, if our data set consists of the
following 8 values (already ordered):

2 6 8 12 17 21 24 25

Q1 = average of the 2nd and 3rd value, or, Q1 = 7


Q3 = average of the 6th and 7th value, or, Q3 = 22.5

and,

IQR = 22.5 – 7 = 15.5


(When we have many observations which are grouped into intervals, an ogive can be used to
approximate the quartiles and, thus, the IQR. (This will illustrated when we get back to grouped data.)

So far, our two measures of spread were based on only two values. Most measures, whether they
measure ‘centre’ or ‘spread’, are better if they use all the values in the data set. Two such measures of
spread are the variance and standard deviation. Both measures are calculated by comparing each value
in the data set with the mean of that data set. As with the mean, there is a formula based on a sample’s
data set and a formula based on a population’s data set.

s2 =
∑ ( y i − ȳ )2
sample variance: n−1

σ =
∑2( yi −μ )2
population variance: N

sample standard deviation:


s=
√∑ ( y i − ȳ )2
n−1

√ ∑ 2
( y i−μ )
σ=
population standard deviation: N
Observations:
 Standard deviations are simply the square roots of the variances
 Unlike the mean, the values of the standard deviations and variances, even if the sample and
population contain exactly the same values, will differ because the sample measures divide the
sums by number of observations minus 1 while the population measures divide the sums by the
number of observations
 The units of measurement for the standard deviations are the same as the units of
measurement of the original data (e.g., if y is measured in $’s, the standard deviation is also
measured in $’s) while the units of measurements for the variances are the square of the units
of measurement of the original data (e.g., if y is measured in $’s, the variance is measured in $ 2).

Getting back to our 5 examples, calculating the variances and standard deviations is easier if we list the
data in column form and create two additional columns as follows:

Ex. 1 ( ȳ=3 ) Ex. 2 ( ȳ=3 ) Ex. 3 ( ȳ=3 )


2 2
y y− ȳ ( y− ȳ ) y y− ȳ ( y− ȳ ) y y− ȳ ( y− ȳ )2
1 -2 4 1 -2 4 1 -2 4
2 -1 1 1 -2 4 3 0 0
3 0 0 3 0 0 3 0 0
4 1 1 5 2 4 3 0 0
5 2_____ 4 5 2 4 5 2 4
15 0 10 15 0 16 15 0 8
2
s=
∑ ( y i − ȳ )2 10
= =2 .5 2
s=
∑ ( y i − ȳ )2 16
= =4 s=2∑ ( y i − ȳ )2 8
= =2
n−1 4 n−1 4 n−1 4

s= √ s 2=√ 2.5≈1.58 s= √ s 2=√ 4=2 s= √ s 2=√ 2≈1. 41

Ex. 4 ( ȳ=6 ) Ex. 5 ( ȳ=8 )


2
y y− ȳ ( y− ȳ ) y y− ȳ ( y− ȳ )2
1 -5 25 1 -7 49
3 -3 9 8 0 0
5 -1 1 9 1 1
7 1 1 11 3 9
14 8_____ 64 11 3 9
30 0 100 40 0 68

2
s=
∑ ( y i − ȳ )2 100
= =25 2
s=
∑ ( y i − ȳ )2 68
= =17
n−1 4 n−1 4

s= √ s 2=√ 25=5 s= √ s 2=√ 17≈4 .12

Observations:
 Based on either the variance or the standard deviation, the ordering of examples from least
spread, or variability, to most spread matches the ordering based on our previous reasoning
concerning a distribution’s spread.
 The variance and standard deviation uses all values in the data set while range and interquartile
range basically uses two values in the data set.

Two other examples

Three previous perceptions were that: variability tends to be smaller the more the values are bunched in
the middle of the data set; variability tends to be larger the more the values are bunched at the ends of
the data set; and, variability tends to be larger the greater the range of the values in the data set. Let’s
test these perceptions with the following two sample data sets:

Ex. 6 ( ȳ=5 ) Ex. 7 ( ȳ=5 )


y y− ȳ ( y− ȳ )2 y y− ȳ ( y− ȳ )2
1 -4 16 2 -3 9
5 0 0 2 -3 9
5 0 0 4 -1 1
5 0 0 6 1 1
5 0 0 8 3 9
9 4 16 8 3 9
30 0 32 30 0 38
32
=6 . 4 s tan dard deviation=√ 6 . 4≈2 . 53
Ex. 6: range = 8 variance = 5
38
=7 .6 s tan dard deviation=√ 7. 6≈2 .76
Ex. 7: range = 6 variance = 5

Observations:
 The range of values are larger in example 6 than in example 7, giving us the perception that
there is a greater spread of values in example 6. (Comparing the ranges supports this
perception.)
 The values in example 6 tend to be bunched in the centre of the data set while the values in
example 7 tend to be bunched at the ends of the data set, giving us the perception that the
spread is less in example 6. (Comparing the variances or standard deviations supports this
perception.)
 Since the variance and standard deviation considers all values in the data set while the range
only considers only the smallest and largest values in the data set, one would usually look at the
variance or standard deviation when comparing the spread or variability of a data set if there is
conflicting results.

Measures of spread/variability with grouped data

Using the ogive, we can easily estimate the range and the interquartile range. The range would simply
be the difference between the upper boundary of the highest interval minus the lower boundary of the
lowest interval. (Obviously, with open-ended intervals, we would not be able to estimate the range.)
And, the IQR would be the difference between Q3 and Q1, where these values are estimated by looking
at the 75% and the 25% cumulative percentages.

As with being able to estimate the mean with grouped data, estimating the variance and standard
deviation with grouped data again requires some assumption. As with estimating the mean, we will
assume that each observation within a particular interval has a value equal to the midpoint of that
interval. (Note: If we were to assume that the values in each particular interval were equally spread out,
we would get slightly different estimates and they would be better estimates. But, the additional effort
necessary to determine these better estimates is not considered to be worth it and is not usually used.

Using the formula for the variance with raw sample data,

2
s=
∑ ( y i − ȳ )2
n−1

We modify it to read,

s2g =
∑ f j (m j− ȳ g )2
n−1
(We changed the y i ' s to mj’s and we multiplied each squared difference by fj to reflect the fact that each
squared difference is repeated fj times, one for each observation in interval j.)

Expanding on our previous table of marks, and the estimated mean of 61.5,

Marks
mj fj mj f j m j− ȳ g (m j− ȳ g )2 f j (m j − ȳ g )2
40 – 49 44.5 5 222.5 -17 289 1445
50 – 59 54.5 20 1090.0 -7 49 980
60 – 69 64.5 15 967.5 3 9 135
70 – 79 74.5 6 447.0 13 169 1014
80 – 89 84.5 3 253.5 23 529 1587
90 – 99 94.5 1 94.5 33 1089 1089
50 3075.0 6250

s2g =
∑ f j (m j− ȳ g )2 =6250 ≈127 . 055 and s≈11. 29
n−1 49

Five-Number Summary and Box Plots

Another way of describing a quantitative variable’s distribution along with some specific statistics is
through a boxplot (aka, a box-and-whisker plot). This plot is created based on knowing 5 values in the
data set and knowing if there are any values to be considered to be ‘outliers’. The 5 values are: the
minimum value, Q1, Q2, Q3 and the maximum value in our data set. (If there are ‘outliers’, we also need
to know the smallest and largest values that are not considered to be ‘outliers’.)

What these 5 numbers do for is to break down our data set into 4 ‘ranges’, each containing 25% of the
data. A ‘picture’, based on these ranges, then allows us to get some idea of the variable’s distribution
while at the same time recognizing the existence of ‘outliers’ as defined in the context of boxplots.

Before we create a boxplot from our students’ marks data set, we will first create 5 ‘perfect’ boxplots
and use them to describe the shapes of five ‘perfect’ distributions.

A uniform distribution

B unimodal symmetrical distribution

C bimodal symmetrical distribution


D right-skewed distribution

E left-skewed distribution

Why were the distributions of each of the above boxplots described as above?

Each of the quarters of the boxplot contain 25% of the values. The narrower the quarter, the more
bunched up the data are in that quarter. This would be equivalent to a histogram having higher bars in
intervals with more observations.

Using Excel and its calculations of quartiles, the five-number summary of our students’ marks:

Min = 40
Q1 = 52.25
Q2 = 59.50
Q3 = 67.25
Max = 95
And,
IQR = Q3 – Q1 = 67.25 – 52.25 = 15

Before we create our boxplot, we should first identify the outliers. In the context of boxplots, outliers
are defined as any values that are:

 More than 1.5 IQR’s less than Q1 (in this example, less than 52.25 – 1.5(15) = 29.75)
 More than 1.5 IQR’s greater than Q3 (in this example, greater than 67.25 +1.5(15) = 89.75)

Looking at our stem-and-leaf display for this data set, we have identified only one outlier – the mark of
95.

When outliers are present, they are omitted when we draw our ‘whiskers’ of our boxplot in that the left
side whisker will not start at the smallest outlier. It would start at the smallest value which is not an
outlier. Conversely, the right side whisker will not end at the largest outlier. It would end at the largest
value which is not an outlier. Since there is no small outliers in this data set, the left side whisker will
start at the minimum value of 40. Since there is one large outlier (95), the right side whisker will not end
at 95. Instead it would end at the largest value in our data set which is not an outlier. And, that value is
87. The outlier will still be part of the box and whisker plot, but it would be identified by a dot. The text
mentions that if an outlier is more that 3 IQR’s less from Q1 or more than 3 IQR’s greater than Q3, it will
also be plotted but using a different symbol (e.g. an *).
Starting the left whisker at 40 and ending the right whisker at 87, our box plot for this set of data looks
like:

This boxplot indicates a right-skewed distribution, but not as pronounced, as did the stem-and-leaf
display and histogram for this set of data.

Standardizing

Comparing an individual value of y with the average value of y, ȳ , in a sample data set, tells us how
different this value is from its mean. But, comparing these differences across two or more different data
sets may not give us a true picture of how different these values are, or, these comparisons would be
useless if the units of measurement of these various data sets are not all the same. For example, how
can we compare how different a Canadian’s hourly wage is from the average Canadians’ wages with an
American’s hourly wage from the average American’s wages when Canadian dollars are different from
American dollars? We can’t unless we convert both countries’ wages to either Canadian or American
dollars, or, we standardize their differences. And, we standardize by determining these differences in
terms of how many standard deviations a Canadian’s wage is from the average Canadian wage, and how
many standard deviations an American’s wage is from the average American wage. The conversion to
standardized values (aka z-scores) can be formulated as:

y i − ȳ
z i=
s if we have a sample data set

y i −μ
z i=
σ if we have a population data set

Observation

Since the values in both the numerator and denominator are measured in the same units (in this
example, Canadian dollars in one sample and American dollars in the other sample), these units of
measurement cancel one another out when calculating z, allowing us to compare differences from the
means when the values in the data sets are measured in different units (i.e., it allows us to compare
apples from oranges).

An example

Suppose there are 3 sections in an accounting course, each taught by different professors whose
standards differ (i.e., one is generous with his/her grades, one is not as generous, and, one is stingy with
his/her grades). And, suppose a $100 gift card for the bookstore would be given to the one student in
this course with the best performance. Obviously, the only student in each section that would be
considered for this gift card would be the student with the highest grade in his/her section. The three
highest marks were: Sue’s mark of 97 in the section with the generous professor; Bob’s marks of 85 in
the section with the professor that was not as generous; and, Carol’s mark of 72 in the stingy professor’s
section. Based solely on the highest mark, Sue would get the gift card. But, would that be fair to the
students whose professors marked less generously?

Suppose we had the following statistics for each of the three sections. (Although we are dealing with 3
populations and not 3 samples and should be using the symbols for the population calculations of z, we
will use the sample symbols as the text does not give us the formula for the population calculations of z.)

Section Professor Student’s mark Average mark Standard deviation


1 generous 97 85 7
2 less generous 85 70 10
3 stingy 72 55 6

Converting each of these 3 student marks to z:

y i − ȳ 97−85
z i= = ≈1. 71
Sue: s 7

y i − ȳ 85−70
z i= = =1 .50
Bob: s 10

y i − ȳ 72−55
z i= = ≈2 .83
Carol: s 6

Interpreting these ‘scores’: Sue’s mark was approximately 1.71 standard deviations above the average
for her section; Bob’s mark was approximately 1.50 standard deviations above his class’s average; and,
Carol’s mark was approximately 2.83 standard deviations above her class’s average. As will become
more evident when we discuss material in chapter 8, the more that z-scores differ from 0, the more
difficult it is to obtain those z-scores for most distributions. Therefore, based on the z-scores of these 3
students, the $100 gift card should be given to Carol because the mark she obtained, even though it was
the lowest of the 3 marks, was most difficult to obtain (in a positive direction).

Time series plots

Although most of the statistics and graphs could be created for both cross-sectional and time-series data
(although using the arithmetic mean as a measure of average growth rate is a problem), these statistics
and graphs omit an important aspect of time-series data – the pattern of this data set over time. The
pattern over time is important in that it can indicate: growth over time, a decline over time; a cyclical
pattern over time; a change in variability over time; no pattern over time; etc. This pattern that has
existed over time should give us important information about what went right or what went wrong with
previous decisions, and, if this pattern is expected to continue (not necessarily a valid assumption), this
information may be used when making decisions now and in the future.
The following line charts give us quarterly sales between 2011 and 2015 for various departments in an
electronics store, with the bottom chart also plotting total sales. (Note: When total sales is also plotted
on the same chart, we lose some sense of the variability of the individual department sales.)

Quarterly Sales, 2011 - 2015


4.5

4.0

3.5

3.0
Sales ($M)

2.5 phones
appliances
2.0 TV's
1.5
computers

1.0

0.5

0.0
11-Q1 11-Q2 11-Q3 11-Q4 12-Q1 12-Q2 12-Q3 12-Q4 13-Q1 13-Q2 13-Q3 13-Q4 14-Q1 14-Q2 14-Q3 14-Q4 15-Q1 15-Q2 15-Q3 15-Q4

Quarter

From the above line chart, management may observe that: phone sales over time are quite variable
although there appears to be a positive trend; appliances sales over time do not appear to be growing,
with sales being influenced by the seasons in which the sales occurred; TV sales also do not appear to be
growing, with sales being cyclical with 3-year cycles; and, computer sales appear to be trending
downward in a linear fashion without much variability from quarter to quarter.

The chart below makes it more difficult to see some of the above patterns because its vertical scale.
But, it does allow management to see that sales over this 4-year period showed were stagnant it that
there was no pattern of either an increase or decrease.
Quarterly Sales (2011 - 2015)
11
10
9
8
7 phones
sales ($M)

6 appliances
5 TV's
computers
4
Total
3
2
1
011-Q1 11-Q2 11-Q3 11-Q4 12-Q1 12-Q2 12-Q3 12-Q4 13-Q1 13-Q2 13-Q3 13-Q4 14-Q1 14-Q2 14-Q3 14-Q4 15-Q1 15-Q2 15-Q3 15-Q4

quarter

Some closing observations:

1. Some statistical measures used all values in the data set (mean, variance, standard deviation,
weighted mean) while others (mode, median, quartiles, range, IQR, grouped mean, grouped
variance, grouped standard deviation) did not. If possible, it is usually preferable to use
measures which use all the data set’s values unless outliers may result in these measures not
being accurate estimates of the equivalent population measures.
2. Of the above measures, the variance is the only measure whose units are the data set’s original
units squared.
3. All of the above measures, except the range, adding another observation to an existing data set,
will results in each of those measures either, increasing, decreasing, or, staying the same. The
value of the range can only increase or stay the same.
4. Adding a constant to each of the data set’s original values will increase (or decrease if the
constant is negative) the values of the original mean, median and mode by that constant but will
not affect the values of the range, IQR, variance or standard deviation.
5. Multiplying each of the data set’s original values by a positive constant will result in the values
of the mean, median, mode, range, IQR and standard deviation being the original values
multiplied by that constant. The new variance will become the value of the original variance
multiplied by that constant squared.
6. When creating histograms, distorted ‘pictures’ of the data set’s distribution would be created if
the widths of histogram’s intervals are not all the same (excluding open-ended classes).
7. Changing the scale of the vertical axis of a line chart can distort the pattern of a quantitative
variable over time in that it can either make the changes appear more important than they
really are by reducing the range of the scale, or, make them appear less important by increasing
the range of the vertical scale.

You might also like