Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Topic 4.

1: Measures of Variation
Our discussion of measures of central tendency in the previous unit do not provide all the
information you would like to know about your data. Besides knowing the central point around
which other values cluster (measures of central tendency), it is equally important to understand
how spread the values are from the central point in your dataset or how the values vary or are
different from one another. That is, in addition to an estimate of the typical value, it is also
important for you to know how widely the values are scattered around the typical value.
To illustrate this point, suppose you have two datasets which show the ages of selected
individuals: (34, 35, 35, 37, 37, 38) and (13, 15, 17, 55, 56, 60). Both data have the same average
age of 36 (same central tendency) but you can easily tell that the two datasets are different in
terms of how the values vary from one another. The second dataset reveal more variation than
the first dataset. The ages in the first dataset are relatively uniform while the ages in the second
dataset are relatively diverse.
To fully understand your data, you need to measure how different or varied the values are. Your
interest is to measure this variation by knowing how different the values are from each other,
how dispersed the values are from the mean, how varied are the values, or how close the values
are to the mean. The methods that measure the spread of values around their central point are
known as measures of variation because these methods quantify how different, varied or
dispersed the values in the dataset are. There are four methods of measuring or quantifying
variation in a dataset. These methods are the range, inter-quartile range, standard deviation and
coefficient of variation.

(a) Range
The range refers to the difference between the highest value and the lowest value in a dataset.
Suppose the follow data (28, 30, 33, 36, 45, 50, 58) represent the ages of seven employees. The
difference in age between the oldest employee (58 years) and youngest employee (28 years) is
given by the range as (58 – 28) = 30 years. Of the various methods, the range is the easiest and
simplest to calculate, however the range has the disadvantage of considering only the highest and
the lowest values (potential outliers) when describing variation. Thus, the range excludes other
values by ignoring the values in between the lowest and the highest points. This makes the range
very sensitive to the effect of outliers and this has the potential of causing misleading
conclusions about the variability in a dataset.

(b) Inter-Quartile Range


The inter-quartile range (IQR) refers to the difference between the first quartile and third
quartile. The age dataset (28, 30, 33, 36, 45, 50, 58) has the first quartile of 30 and third quartile
of 50. The inter-quartile range is given as (50 – 30) = 20 years. This shows the age difference for
the middle half of the data is 20 years. That is, the first and third quartiles are separated by a
distance or difference of 20 years. The inter-quartile range (IQR) defines the middle half of the
data, thereby avoiding the effect of outliers or extreme value. The inter-quartile range provides
an improvement over the range, as it resolves the issue relating to the effect of outliers which is a
criticism levelled against the range. Although IQR eliminates the effect or sensitivity of outliers,
it still carries with it the problem of not considering all values in the dataset. Similar to the range,
it only accounts for two values, Q1 and Q3, ignoring the rest of the values in the dataset.

(c) Standard Deviation


Standard Deviation quantifies the amount of variation in a dataset. It measures the dispersion of
data from its mean or how much a set of data points deviates from its own average. It is based on
the difference between each observation and the mean. The procedure for finding standard
deviation is a bit complex so a tabular approach is usually used for the calculation. Suppose you
are asked to find a standard deviation of the age dataset (28, 30, 33, 36, 45, 50, 58). The
procedure is laid out in the table below: The first column lists the values in the dataset. The
second column finds the average (280 / 7 = 40). The third column finds the difference between
each individual value and the average. The fourth column squared the difference obtained in the
third column. The next step in the procedure involves summing up the squared differences to
obtain a total of 758. Next, you divide the total by the number of observations minus one (n – 1)
as pertains to the sample data. The result is 758 / (7 – 1) = 126.33 which represents the value for
the variance. In the final step, you find the square root of the variance to obtain the standard
deviation. This is given as: SQRT (126.33) = 11.24.

X Mean X - Mean (X – Mean)2 Variance=758


Standard Deviation=√ 1
28 40 - 12 144

30 40 - 10 100

33 40 -7 49

36 40 -4 16

45 40 5 25

50 40 10 100

58 40 18 324

Total = 758

The standard deviation is 11.24. We can interpret this to say the ages vary by 11.24 years from
the average age of 40 years. Thus, most ages lie within +/- 11.24 from the expected age. In other
words, most of the ages are between 28.76 and 41.24 years. As mentioned, standard deviation
explores the variation of each data point from the mean, and the higher the standard deviation the
further the data points are from the mean, and the more variation we have in the dataset.
The standard deviation is an improvement over the inter-quartile range because it considers all
the values in the dataset when measuring variation. However, it is limited by the fact that it
cannot be used to compare the variation between two datasets measured in different units. For
example, you cannot use the standard deviation to compare variation in heights (measured in
feet) and variation in weights (measured in pounds). That is, when comparing variation between
two datasets, the standard deviation is limited when the two datasets have different units of
measurements.

(d) Coefficient of Variation


Coefficient of Variation (CV) is preferred over the standard deviation when comparing two
datasets measured in different units or have different magnitude of values. The co-efficient of
variation is expressed as a percentage rather than a unit. It represents the ratio of the standard
deviation to the mean, and it is calculated as standard deviation divided by the mean times 100.
The formula is given as: (std / mean) x 100. It measures in percentage, how scatter or how spread
the values are from the mean. In our example about the age dataset, the standard deviation was
11.24 years and the mean was 40 years. The coefficient of variation is given as (11.24 / 40) x 100
= 28.1%. We can interpret this to say the ages vary by 28.1% from the average age. The larger
the coefficient of variation, the more spread and dispersed are the values from the mean, and the
wider the variation.
When comparing two or more data sets, it is proper to use co-efficient of variation as better
meaning is obtained in the comparison. This is because the coefficient of variation provides a
standard way, in a form of percentage, to compare variation for multiple datasets measured in
different units. For example, comparing heights (in feet) and weights (pounds). In this case, the
standard deviations for heights and weights will be transformed into percentage for better
comparison to be made. Suppose in a class, the average height of students is 5.6 feet with a
standard deviation of 0.8 feet, and the average weight of students is 170 pounds with a standard
deviation of 20 pounds. You want to know which one varies more than the other - height or
weight? You cannot directly use feet and pounds to compare between height and weight because
they are in different units of measurement. To solve this problem, we need to transform feet and
pounds into percentage by converting the standard deviation to coefficient of variation. The
coefficient of variation for height is (0.8 / 5.6) x 100 = 14.29% while the coefficient of variation
for weight is (20 / 170) x 100 = 11.76%. Therefore, we can say that we have 14.29% variation in
heights as compared to 11.76% variation in weight. So, we have more variation in height than
weight.
It is worth noting that the range, inter-quartile range, variance and standard deviation are
absolute measures while co-efficient of variation (CV) is a relative measure of variation. Hence,
coefficient of variation is also known as relative variability. The only limitation associated with
coefficient of variation is that it cannot be calculated when the mean is zero. For instance, when
an average temperature is 0°C.
Readings
Readings from course textbook: Chapter 3.2

Topic 4.2: Measures of Shape


You have learned that measures of central tendency tell you about the center point in your data
while the measures of variation tell you about how spread is the values from the center point.
The next thing you may want to know about your data is the shape of the data. The two methods
that help you understand and precisely measure the shape of your data are skewness and kurtosis.
Skewness measures the direction of outliers or extreme values in your data while kurtosis
measures the magnitude of outliers or extreme values in your data. A skewness test is applied to
symmetrical distribution while kurtosis test is applied to asymmetrical distribution.

Skewness
When data lack symmetry then it may skew towards one direction, left or right. There are three
forms of skewness: left-skew distribution, right-skewed distribution and normal distribution.

(a) Left-Skewed Distribution


A left-skewed distribution, also known as a negative skewed, possesses the following
characteristics: (a) you will observe a long tail to the left of the distribution, stretching towards
the negative direction or left side of the graph (b) the extreme value lies to the left of the
distribution, (c) the bulk of the data are clustered at the high-end of the distribution, (d) the mean
is less than the median (e) the bulk of the data are above the mean and (f) the computed MS
Excel skewness coefficient is less than – 0.5. The following dataset (4, 80, 82, 88, 88, 90, 100)
has the mean of 76 which is less than the median of 88, an indication of a left-skewed
distribution. Mathematically speaking, the mean minus median will yield a negative result 76 –
88 = – 12) which explains why a left-skewed distribution is also known as a negative skewed
distribution. You can compare the mean and median to crudely define the skewness of your data,
however a skewness coefficient computed using MS Excel provides more accurate and robust
description of skewness. When we use MS Excel to compute the skewness coefficient for the
dataset, we obtained a value of – 2.42, which is less than – 0.5 hence the distribution is highly
negative skewed or left skewed. When you look at the dataset, you will notice that the extreme
value of 4 lies to the left of the distribution (ranked data) with most values typically high in
magnitude. The time spent by shoppers in shopping is an example of a left-skewed
distribution.

(b) Right-Skewed Distribution


A right-skewed distribution, also known as a positive skewed, possesses the following
characteristics: (a) you will observe a long tail to the right of the distribution, stretching towards
the positive direction or right side of the graph (b) the extreme value lies to the right of the
distribution, (c) the bulk of the data are clustered at the low-end of the distribution, (d) the mean
is more than the median (e) the bulk of the data are below the mean and (f) the computed MS
Excel skewness coefficient is more than + 0.5. The following dataset (2, 4, 4, 7, 10, 12, 80) has
the mean of 17 which is less than the median of 7, an indication of a right-skewed distribution.
Mathematically speaking, the mean minus median will yield a positive result 17 – 7 = 10) which
explains why a right-skewed distribution is also known as a positive skewed distribution. When
we use MS Excel to compute the skewness coefficient for the dataset, we obtained a value of +
2.56, which is more than + 0.5 hence the distribution is highly positively skewed or right
skewed. When you look at the dataset, you will notice that extreme value of 80 lies to the right of
the distribution (ranked data), with most values typically low in magnitude. The response time to
emergency calls is an example of a right-skewed distribution.

(c) Normal Distribution


A normal distribution, also known as a zero skewed, possesses the following characteristics: (a)
you will observe a long tail to both left and right of the distribution, stretching towards the
negative and positive direction of the graph (b) the extreme values lie both to the left and right of
the distribution, (c) the bulk of the data are clustered around the middle of the distribution, (d)
the mean is equal to the median (e) most of values are close to the mean and (f) the computed
MS Excel skewness coefficient is between – 0.5 and + 0.5. The following dataset (1, 22, 24, 25,
26, 28, 49) has the mean of 25 which is equal to the median of 25, an indication of a normal
distribution. Mathematically speaking, the mean minus median will yield a zero result 25 – 25 =
0) which explains why a normal distribution is also known as a zero-skewed distribution. When
we use MS Excel to compute the skewness coefficient for the dataset, we obtained a value of 0,
which lies between – 0.5 and + 0.5 hence the distribution is zero skewed, symmetrical or normal
distribution. When you looked at the dataset, you will notice the extreme values of 1 and 49 lie at
the left and right of the distribution (ranked data), with most values typically around the average
values. The distribution of students’ grades is an example of a normal distribution (or bell-
curve).

Kurtosis
Kurtosis is also another descriptive method that can be used to describe a shape of a data
distribution. It measures the tail-ness and peaked-ness of a distribution relative to a normal
distribution. It describes the degree to which values are clustered in the tail and peak of a
distribution. However, it lays more emphasis on the tail-ness than the peak-ness. Kurtosis
determines whether the tails of a data distribution match the normal distribution, that is how
heavy or light are the tails. By quantifying the tail-ness of a distribution, kurtosis seeks to explore
the presence and frequency of outliers by looking at how much extreme observations are there in
the dataset, whether the dataset has too many and fewer extreme values than normal. Modern
definition shows that kurtosis is influenced more by extreme values (tails) than the values in the
center (peak) of the distribution. A recent definition looks at how much the variation in the data
is due to extreme values. We will use MS Excel to compute what is known as excess kurtosis
coefficient to describe three forms of kurtosis: – platykurtic, leptokurtic and mesokurtic
distributions.

(a) Platykurtic Distribution


A platykurtic distribution, also known as negative kurtosis, describes a distribution with less
variation than a normal distribution. The curve looks lowly peaked at the center (fairly flat at the
top) with thinner tails at the extreme ends. When compared to a normal distribution, a platykurtic
distribution has fewer extreme values in the tail (less tailed) and less values around the center
(less peaked). Because it has thinner tails and is lowly peaked, there are less data in the tails,
more data in the shoulder and less data around the center. Less data in the tail means a
platykurtic distribution has fewer outliers than normal, an indication of less risk compared to a
normal distribution. A platykurtic distribution also portrays that small changes are more common
but large changes are less likely due to the thinner tails. When compute with MS Excel, a
platykurtic distribution has an excess kurtosis coefficient of less than – 0.5, with coefficient
values between – 0.5 and – 1.0 being described as moderate negative kurtosis, and coefficient
values less than – 1.0 being described as high negative kurtosis. The dataset (12, 24, 19, 28, 15,
30) has an excess kurtosis coefficient of – 1.85 which is less than – 0.5, hence the distribution
can be described as highly platykurtic or high negative kurtosis.

(b) Leptokurtic Distribution


A leptokurtic distribution, also known as positive kurtosis, describes a distribution with more
variation than a normal distribution. The curve looks highly peaked at the center with heavy or
fatty tails at the extreme ends. When compared to a normal distribution, a leptokurtic distribution
has more extreme values in the tail (more tailed) and more values around the center (high
peaked). Because it has fatty tails and is highly peaked, there are more data in the tails, less data
in the shoulder and more data around the center. More data in the tail means the leptokurtic
distribution has many outliers than normal, an indication of more risk compared to a normal
distribution. A leptokurtic distribution also portrays that large changes are more common but
small changes are less likely due to the fatty tails. When compute with MS Excel, a leptokurtic
distribution has an excess kurtosis coefficient of more than + 0.5, with coefficient values
between + 0.5 and + 1.0 being described as moderate positive kurtosis, and coefficient values
more than + 1.0 being described as high positive kurtosis. The dataset (55, 24, 19, 28, 10, 2) has
an excess kurtosis coefficient of +1.68 which is more than + 0.5, hence the distribution can be
described as highly leptokurtic or high positive kurtosis.

(c) Mesokurtic Distribution


A mesokurtic distribution, also known as zero kurtosis, describes a distribution with similar
characteristics as a normal distribution. It has the same amounts of variation, peaked-ness and
tailed-ness as a normal distribution. A zero kurtosis has same amount of values in the tails as a
normal distribution as well as the same degree of clustering around the mean as a normal
distribution. It shows similar amount of data in the tails, in the shoulder and around the center as
a normal curve. It has same amounts of outliers similar to what is expected of a normal
distribution, also an indication of similar risk. When compute with MS Excel, a mesokurtic
distribution has an excess kurtosis coefficient between – 0.5 and + 0.5, with coefficient value of
0 being described as a perfect kurtosis. The dataset (21, 16, 18, 22, 10, 15) has an excess kurtosis
coefficient of + 0.02 which lies between – 0.5 and + 0.5, hence the distribution can be described
as mesokurtic or zero kurtosis.
The larger the kurtosis coefficient, the more extreme values there are, the more peaked the
distribution around the mean, the more variation there is, and the more risk associated with the
distribution. Conversely, the smaller the kurtosis coefficient, the less extreme values there are,
the less peaked the distribution around the mean, the less variation there is, and the less risk
associated with the distribution. The measure of kurtosis can be used to describe the spread of
covid-19 infections. Without protective measures (e.g. social distancing and mask) to slow the
spread of infection, the rate of infection will quickly peak at fast rate and this will overwhelm the
capacity of the health care system, thereby exposing people to greater risk of infection and
potential death (leptokurtic distribution). On the other hand, if protective measures are
implemented to slow the spread of infection, the rate of infection will lowly peaked and flatten at
the top at a slow pace which can be supported by the health care system (platykurtic
distribution).

Readings
Readings from course textbook: Chapter 3.1

You might also like