Professional Documents
Culture Documents
Topic 4.1: Measures of Variation: (A) Range
Topic 4.1: Measures of Variation: (A) Range
1: Measures of Variation
Our discussion of measures of central tendency in the previous unit do not provide all the
information you would like to know about your data. Besides knowing the central point around
which other values cluster (measures of central tendency), it is equally important to understand
how spread the values are from the central point in your dataset or how the values vary or are
different from one another. That is, in addition to an estimate of the typical value, it is also
important for you to know how widely the values are scattered around the typical value.
To illustrate this point, suppose you have two datasets which show the ages of selected
individuals: (34, 35, 35, 37, 37, 38) and (13, 15, 17, 55, 56, 60). Both data have the same average
age of 36 (same central tendency) but you can easily tell that the two datasets are different in
terms of how the values vary from one another. The second dataset reveal more variation than
the first dataset. The ages in the first dataset are relatively uniform while the ages in the second
dataset are relatively diverse.
To fully understand your data, you need to measure how different or varied the values are. Your
interest is to measure this variation by knowing how different the values are from each other,
how dispersed the values are from the mean, how varied are the values, or how close the values
are to the mean. The methods that measure the spread of values around their central point are
known as measures of variation because these methods quantify how different, varied or
dispersed the values in the dataset are. There are four methods of measuring or quantifying
variation in a dataset. These methods are the range, inter-quartile range, standard deviation and
coefficient of variation.
(a) Range
The range refers to the difference between the highest value and the lowest value in a dataset.
Suppose the follow data (28, 30, 33, 36, 45, 50, 58) represent the ages of seven employees. The
difference in age between the oldest employee (58 years) and youngest employee (28 years) is
given by the range as (58 – 28) = 30 years. Of the various methods, the range is the easiest and
simplest to calculate, however the range has the disadvantage of considering only the highest and
the lowest values (potential outliers) when describing variation. Thus, the range excludes other
values by ignoring the values in between the lowest and the highest points. This makes the range
very sensitive to the effect of outliers and this has the potential of causing misleading
conclusions about the variability in a dataset.
30 40 - 10 100
33 40 -7 49
36 40 -4 16
45 40 5 25
50 40 10 100
58 40 18 324
Total = 758
The standard deviation is 11.24. We can interpret this to say the ages vary by 11.24 years from
the average age of 40 years. Thus, most ages lie within +/- 11.24 from the expected age. In other
words, most of the ages are between 28.76 and 41.24 years. As mentioned, standard deviation
explores the variation of each data point from the mean, and the higher the standard deviation the
further the data points are from the mean, and the more variation we have in the dataset.
The standard deviation is an improvement over the inter-quartile range because it considers all
the values in the dataset when measuring variation. However, it is limited by the fact that it
cannot be used to compare the variation between two datasets measured in different units. For
example, you cannot use the standard deviation to compare variation in heights (measured in
feet) and variation in weights (measured in pounds). That is, when comparing variation between
two datasets, the standard deviation is limited when the two datasets have different units of
measurements.
Skewness
When data lack symmetry then it may skew towards one direction, left or right. There are three
forms of skewness: left-skew distribution, right-skewed distribution and normal distribution.
Kurtosis
Kurtosis is also another descriptive method that can be used to describe a shape of a data
distribution. It measures the tail-ness and peaked-ness of a distribution relative to a normal
distribution. It describes the degree to which values are clustered in the tail and peak of a
distribution. However, it lays more emphasis on the tail-ness than the peak-ness. Kurtosis
determines whether the tails of a data distribution match the normal distribution, that is how
heavy or light are the tails. By quantifying the tail-ness of a distribution, kurtosis seeks to explore
the presence and frequency of outliers by looking at how much extreme observations are there in
the dataset, whether the dataset has too many and fewer extreme values than normal. Modern
definition shows that kurtosis is influenced more by extreme values (tails) than the values in the
center (peak) of the distribution. A recent definition looks at how much the variation in the data
is due to extreme values. We will use MS Excel to compute what is known as excess kurtosis
coefficient to describe three forms of kurtosis: – platykurtic, leptokurtic and mesokurtic
distributions.
Readings
Readings from course textbook: Chapter 3.1