Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 9

Comparisons of Statistics

Professor Harvey A. Singer


School of Management
George Mason University

I Inferences about the Data Distribution.


1 Skewness vs. Symmetry.

Skewness refers to the lack of symmetry of distribution of data about its center.

1.1 A simple skewness factor.

The quantity

x − xˆ
x
is a non-dimensional measure of the lack of symmetry or skewness, where x is the
sample mean and x̂ is the sample median. It compares the offset between the sample
mean and median to the sample mean. This skewness factor contains an algebraic sign.
The algebraic sign indicates the direction of skewing. The magnitude of the skewing
factor indicates the severity of the skewing. A skewness factor with a value of zero
indicates no skewing: the mean and median coincide.

When the median is less in magnitude than the mean, x > xˆ and the median lies to the
left of the mean on the number line. As a result, the term in the numerator is positive and
so too is the skewness factor. In this case, the distribution is positively skewed so that the
“hump” is on the left and the long tail is on the right; the distribution is right-skewed.
The interpretation is that most of the data is less in value than the sample mean. More of
the sample data lie below (to the left) of the mean than above it (to the right).

When the median is greater in magnitude than the mean, x < xˆ and the median lies to the
right of the mean on a number line. As a result, the term in the numerator is negative and
so too is the skewness factor. In this case, the data distribution is negatively skewed so
that the “hump” is on the right and the long tail is on the left; the distribution is left-
skewed. The interpretation is that most of the data is greater in value than the sample
mean. More of the sample data lie above (to the right) of the mean than below it (to the
left).

© 1999 by Harvey A. Singer 1


The magnitude indicates the degree of the skewing. The higher the magnitude the more
severely skewed and the more asymmetric is the distribution.

In summary,

 < 0 i f s k e wl e fdt
x − xˆ 
 = 0 i f s y m m (en tosr ki ce) w
x 
 > 0 i f s k e wr ei gd h t

1.2 An alternative skewness factor.

An alternative non-dimensional measure of skewing is simply the ratio of the median to


mean, viz


x

is another dimensionless measure of the skew or lack of symmetry of a distribution. Here

 < 1 i f r i g h− st k e w e d
xˆ 
 = 1 i f s y m m e(nt rosi ck e) w
x
 > 1 i f l e f−t s k e w e d

1.3 Pearson skewness factor.

Yet another dimensionless measure of skewness is the Pearson skewness factor, defined
as

3( x − xˆ )
s

The same rules and interpretations apply as for the simple skewness factor in section 1.1
above.

© 1999 by Harvey A. Singer 2


2 Clustering.

2.1 A simple clustering factor.

A simple dimensionless measure of the clustering of the data over its range is

IQR
R

where the range R = xmax – xmin is the difference in magnitude between the largest data
value xmax and the smallest data value xmin. The interquartile IQR = Q3 – Q1 is the
difference between the first and third quartiles, Q1 and Q3, respectively. This measure
looks at the spread of the inner half or 50% of all the sample data over the full range of
the data distribution. If the inner 50% of the data is spread over less than half the full
range, then the distribution must be tightly clustered about its center. In this case the
distribution has a fairly narrow central “hump.” If the inner half of the data takes up
more than the full range then the distribution is broadly scattered about its center. In this
case the distribution has a fairly wide central “hump.” Thus

 < 0.5 i f t i g h ct l yu s t ea rb eo dcue t n t e r


I Q R
 = 0.5 i f e v e ns pl yr e a d
R
 > 0.5 i f b r o a sdc lay t t ea rbe odcue t n t e r

2.2 Another clustering factor: Coefficient of Variation.

A dimensionless measure of clustering of sample data about the sample mean is the
coefficient of variation, CV, defined as the ratio of the sample standard deviation s to the
sample mean x , viz

s
CV =
x

CV measures the relative variability of the sample data about its sample mean x . That
is, CV is a measure of variability, as expressed by s, relative to the center of the
distribution of the sample data, as expressed by the sample mean x . As such, CV is a
measure of the amount of clustering about x . The smaller is CV, the lower the
variability and the greater the degree of clustering about x .

3 Bell-Shaping of the Data Distribution: Normal Curve Shape.

© 1999 by Harvey A. Singer 3


3.1 Coincidence of central measures.

The distribution is bell-shaped if and only if

x = xˆ = x midquartil e = x mode =

3.2 Comparison to normal model: a clustering test.

Another dimensionless measure of the clustering of the distribution of the sample data is
the ratio of IQR to 1.35 x s. The ratio has the following properties:

 < 1 i f m o rc el u s t ear ne tdi g h t eh ra nn o r m a l


I Q R
 = 1 i f b e −l ls h a p e d
1.3 5s 
 > 1 i f l e s cs l u s t ear ne bd r o a tdhearnn o r m a l
Then

 < 1 i f n a r r ot wh aennro r m a l
I Q R
 = 1 i f b e −l ls h a p e d
1.3 5s 
 > 1 i f w i d t eh ranno r m a l

II Inferences about the Representativeness of x .


The sample mean x is necessarily representative of the sample, because it is calculated
from all the sample data. The issue is how representative is x of the sample, i.e., how
good is x as the single value calculated to stand for and in place of all the sample data.
Hence the adequacy of x as a single value representation of the sample needs to be
measured.

1 The Coefficient of Variability CV.

© 1999 by Harvey A. Singer 4


1.1 Definition and concept.

A dimensionless measure of the relative variability of sample data about the sample mean
is the coefficient of variation, CV, defined as the ratio of the sample standard deviation s
to the sample mean x , viz.

s
CV =
x

CV is a measure of variability, as expressed by s, relative to the center of the distribution


of the sample data, as expressed by the sample mean x . In so doing, CV specifies the
size of the standard deviation as a proportion of the mean. As such, CV is a measure of
the amount of clustering and hence relative variability about x . The smaller is CV, the
lower the relative variability and the greater the degree of clustering about x .

Note: CV is often presented as a percent by multiplying the above decimal by 100.

1.2 Guidance for use.

(a) Self-diagnosis of x on a sample.

For x to be a good single value representation for the entire sample, the relative
variability must be very small. As a result, CV must be very much smaller than unity, viz

CV «1

As a rule of thumb, x is a very good single value representation of the entire sample if

CV < 0.001

For larger CV, x is described as good, fair, poor, etc., for larger and larger CV. These
terms may themselves be qualified, e.g., “very poor”, “somewhat good”, etc. A CV
of 0.1 or higher generally indicates that x is a very poor representation of the
sample because the data is so widely scattered and consequently the sample has a
high amount of relative variability.

(b) Comparison of samples.

As a relative measure, CV is particularly useful when comparing the variability of two or


more samples or data sets that have different sample means, different sample standard
deviations, or perhaps are even expressed in different units of measurement.

For example, suppose stock A has averaged $50 per share over the past month with a
standard deviation of $10/share. Stock B has averaged $12/share with a standard
deviation of $4. In actual dollars, looking at the standard deviations, stock A seems to be

© 1999 by Harvey A. Singer 5


more variable, that is, more volatile, than stock B. However for stock A the coefficient of
variation is CVA = ($10/$50) = 0.20. For stock B the coefficient of variation is CVB =
($4/$12) = 0.33. Thus relative to their respective means, the price of stock B is much
more variable than the price of stock A.

1.3 Implications for estimation.

In so far as the sample is itself representative of the population from which it was drawn,
the sample statistics may or may not be good representations of the corresponding true,
but a priori unknown, population parameters. Specifically, if the sample is not
representative of the population, then the sample statistics cannot be used to estimate the
corresponding population parameters. If the sample is representative of the population
then its statistics may be used as estimators of the corresponding population parameters.
(A way of ensuring representative samples is to select population members at random,
that is free of any and all bias.) In particular, the sample mean x may be used as an
estimate of the population mean µ .

It is an accident of luck that any particular population member is sampled. That amount
of luck is measured by the probability of being selected for the sample. Because the
sample mean depends on what was sampled, the sample and population means will
hardly ever exactly agree (except in very rare circumstances). As a result, there will
always be a difference, or some “error”, between x and µ .

If x is a good single-value representation of the entire sample, as measured by the CV,


then the sample mean will be a good estimator of the true, but unknown, population mean
with a fairly small margin of error. Generally, the better x is at representing the entire
sample, as measured by the CV, the smaller will be the margin of error, and the better x
will be at estimating µ .

If x is a poor representation of the entire sample, then the sample mean will be a poor
estimator of the population mean. Approximating or estimating the population mean
from the sample mean will entail a fairly large margin of error. This large margin of
error accounts for not only the accidents of sampling but also the “sloppiness” of x at
representing the sample.

1.4 Special uses.

The reciprocal of CV is used in finance to compare the relative risk of various alternative
investment options. This risk measure is referred to as a return-to-risk ratio. The risk of
a particular investment option is measured by the standard deviation of the payoffs for
resulting from that option under all possible economic/financial conditions. The return of
an investment option is the expected monetary value, EMV, of that option, that is, the
payoff expected in the long run and on average over all possible conditions. The higher
the return-to-risk ratio, the higher the expected return relative to the risk as expressed by
the standard deviation. Therefore, the higher the return-to-risk ratio, the more preferable
is the investment option.

© 1999 by Harvey A. Singer 6


In the example in paragraph 1.2(b) of this section, the return-to-risk ratios for stocks A
and B are 5.0 and 3.0, respectively. Therefore, relative to the risk as measured by the
standard deviation, the price per share seems higher for stock A than for stock B.
Certainly A has a higher expected value than B, but it also has a higher absolute risk.
Even so, stock A is preferable to stock B.

© 1999 by Harvey A. Singer 7


III Inferences about Data Capture and Outliers
1 Outlier detection.

It is important to be able to identify anomalous values or “outliers” in a sample that may


perturb results and skew inferences.

1.1 The midrange rule.

If

x midrange ≈x

then there are no outliers in the sample data. Outliers may be suspected if

xmidrange » x or xmidrange « x

That is, outliers may be suspected if the sample midrange is either much greater than or is
much less than the sample mean. In the former case, then xmax may be an outlier. For the
latter case, xmin may be an outlier. However, to objectively identify specific sample
values as outliers, the rule in the next section must be applied.

1.2 The 3 s rule.

As a rule, if all the sample data lies with ±3s of the sample mean, then there are no
outliers or extreme values in the sample data. That is, there are no outliers in the sample
if

x max <x +3s


x min >x −3s

where xmax and xmin are the largest and smallest data values in the sample. For there to be
no outliers in the sample, the following inequality must be satisfied:

x − 3s < x min < x max < x + 3s

Therefore, outliers may be identified objectively if they satisfy the following rule:

x max > x + 3s
x min < x − 3s

© 1999 by Harvey A. Singer 8


The procedure to apply this rule is to first calculate both x and s. Add 3 times s to x to
find the upper cutoff for outliers. Subtract 3 times s from x to find the lower cutoff on
outliers. Any sample data that lies beyond either or both these two cutoffs are objectively
identified as outliers.

2 Data Capture.

2.1 Chebychev’s Rule.

In general, at least

1
1−
k2

of all the sample data must be within ±k standard deviations of the mean, where k is any
number greater than 1. For example, at least 0.75, or 75%, of the data must lie within k =
2 standard deviations on either side of the mean.

It is important to verify that Chebychev’s Rule is satisfied by a data sample.

2.2 Empirical Rule.

As a rule of thumb, for most data sets, roughly 2 out of 3 values, or more precisely 68%
of all the data, are contained within a distance of 1 standard deviation around the mean
(that is, on either side of the mean). About 95% of all data values are contained within a
distance of 2 standard deviations around the mean. Very nearly all the data will be within
3 standard deviations of the mean.

The empirical rule is strictly valid only for data that is normally distributed. The rule is
not valid for data that is non-normal. It is important to compare data captures to the
empirical rule to test the sample for normality.

© 1999 by Harvey A. Singer 9

You might also like