Professional Documents
Culture Documents
Comparisons of Statistics
Comparisons of Statistics
Skewness refers to the lack of symmetry of distribution of data about its center.
The quantity
x − xˆ
x
is a non-dimensional measure of the lack of symmetry or skewness, where x is the
sample mean and x̂ is the sample median. It compares the offset between the sample
mean and median to the sample mean. This skewness factor contains an algebraic sign.
The algebraic sign indicates the direction of skewing. The magnitude of the skewing
factor indicates the severity of the skewing. A skewness factor with a value of zero
indicates no skewing: the mean and median coincide.
When the median is less in magnitude than the mean, x > xˆ and the median lies to the
left of the mean on the number line. As a result, the term in the numerator is positive and
so too is the skewness factor. In this case, the distribution is positively skewed so that the
“hump” is on the left and the long tail is on the right; the distribution is right-skewed.
The interpretation is that most of the data is less in value than the sample mean. More of
the sample data lie below (to the left) of the mean than above it (to the right).
When the median is greater in magnitude than the mean, x < xˆ and the median lies to the
right of the mean on a number line. As a result, the term in the numerator is negative and
so too is the skewness factor. In this case, the data distribution is negatively skewed so
that the “hump” is on the right and the long tail is on the left; the distribution is left-
skewed. The interpretation is that most of the data is greater in value than the sample
mean. More of the sample data lie above (to the right) of the mean than below it (to the
left).
In summary,
< 0 i f s k e wl e fdt
x − xˆ
= 0 i f s y m m (en tosr ki ce) w
x
> 0 i f s k e wr ei gd h t
x̂
x
< 1 i f r i g h− st k e w e d
xˆ
= 1 i f s y m m e(nt rosi ck e) w
x
> 1 i f l e f−t s k e w e d
Yet another dimensionless measure of skewness is the Pearson skewness factor, defined
as
3( x − xˆ )
s
The same rules and interpretations apply as for the simple skewness factor in section 1.1
above.
A simple dimensionless measure of the clustering of the data over its range is
IQR
R
where the range R = xmax – xmin is the difference in magnitude between the largest data
value xmax and the smallest data value xmin. The interquartile IQR = Q3 – Q1 is the
difference between the first and third quartiles, Q1 and Q3, respectively. This measure
looks at the spread of the inner half or 50% of all the sample data over the full range of
the data distribution. If the inner 50% of the data is spread over less than half the full
range, then the distribution must be tightly clustered about its center. In this case the
distribution has a fairly narrow central “hump.” If the inner half of the data takes up
more than the full range then the distribution is broadly scattered about its center. In this
case the distribution has a fairly wide central “hump.” Thus
A dimensionless measure of clustering of sample data about the sample mean is the
coefficient of variation, CV, defined as the ratio of the sample standard deviation s to the
sample mean x , viz
s
CV =
x
CV measures the relative variability of the sample data about its sample mean x . That
is, CV is a measure of variability, as expressed by s, relative to the center of the
distribution of the sample data, as expressed by the sample mean x . As such, CV is a
measure of the amount of clustering about x . The smaller is CV, the lower the
variability and the greater the degree of clustering about x .
x = xˆ = x midquartil e = x mode =
Another dimensionless measure of the clustering of the distribution of the sample data is
the ratio of IQR to 1.35 x s. The ratio has the following properties:
< 1 i f n a r r ot wh aennro r m a l
I Q R
= 1 i f b e −l ls h a p e d
1.3 5s
> 1 i f w i d t eh ranno r m a l
A dimensionless measure of the relative variability of sample data about the sample mean
is the coefficient of variation, CV, defined as the ratio of the sample standard deviation s
to the sample mean x , viz.
s
CV =
x
For x to be a good single value representation for the entire sample, the relative
variability must be very small. As a result, CV must be very much smaller than unity, viz
CV «1
As a rule of thumb, x is a very good single value representation of the entire sample if
CV < 0.001
For larger CV, x is described as good, fair, poor, etc., for larger and larger CV. These
terms may themselves be qualified, e.g., “very poor”, “somewhat good”, etc. A CV
of 0.1 or higher generally indicates that x is a very poor representation of the
sample because the data is so widely scattered and consequently the sample has a
high amount of relative variability.
For example, suppose stock A has averaged $50 per share over the past month with a
standard deviation of $10/share. Stock B has averaged $12/share with a standard
deviation of $4. In actual dollars, looking at the standard deviations, stock A seems to be
In so far as the sample is itself representative of the population from which it was drawn,
the sample statistics may or may not be good representations of the corresponding true,
but a priori unknown, population parameters. Specifically, if the sample is not
representative of the population, then the sample statistics cannot be used to estimate the
corresponding population parameters. If the sample is representative of the population
then its statistics may be used as estimators of the corresponding population parameters.
(A way of ensuring representative samples is to select population members at random,
that is free of any and all bias.) In particular, the sample mean x may be used as an
estimate of the population mean µ .
It is an accident of luck that any particular population member is sampled. That amount
of luck is measured by the probability of being selected for the sample. Because the
sample mean depends on what was sampled, the sample and population means will
hardly ever exactly agree (except in very rare circumstances). As a result, there will
always be a difference, or some “error”, between x and µ .
If x is a poor representation of the entire sample, then the sample mean will be a poor
estimator of the population mean. Approximating or estimating the population mean
from the sample mean will entail a fairly large margin of error. This large margin of
error accounts for not only the accidents of sampling but also the “sloppiness” of x at
representing the sample.
The reciprocal of CV is used in finance to compare the relative risk of various alternative
investment options. This risk measure is referred to as a return-to-risk ratio. The risk of
a particular investment option is measured by the standard deviation of the payoffs for
resulting from that option under all possible economic/financial conditions. The return of
an investment option is the expected monetary value, EMV, of that option, that is, the
payoff expected in the long run and on average over all possible conditions. The higher
the return-to-risk ratio, the higher the expected return relative to the risk as expressed by
the standard deviation. Therefore, the higher the return-to-risk ratio, the more preferable
is the investment option.
If
x midrange ≈x
then there are no outliers in the sample data. Outliers may be suspected if
xmidrange » x or xmidrange « x
That is, outliers may be suspected if the sample midrange is either much greater than or is
much less than the sample mean. In the former case, then xmax may be an outlier. For the
latter case, xmin may be an outlier. However, to objectively identify specific sample
values as outliers, the rule in the next section must be applied.
As a rule, if all the sample data lies with ±3s of the sample mean, then there are no
outliers or extreme values in the sample data. That is, there are no outliers in the sample
if
where xmax and xmin are the largest and smallest data values in the sample. For there to be
no outliers in the sample, the following inequality must be satisfied:
Therefore, outliers may be identified objectively if they satisfy the following rule:
x max > x + 3s
x min < x − 3s
2 Data Capture.
In general, at least
1
1−
k2
of all the sample data must be within ±k standard deviations of the mean, where k is any
number greater than 1. For example, at least 0.75, or 75%, of the data must lie within k =
2 standard deviations on either side of the mean.
As a rule of thumb, for most data sets, roughly 2 out of 3 values, or more precisely 68%
of all the data, are contained within a distance of 1 standard deviation around the mean
(that is, on either side of the mean). About 95% of all data values are contained within a
distance of 2 standard deviations around the mean. Very nearly all the data will be within
3 standard deviations of the mean.
The empirical rule is strictly valid only for data that is normally distributed. The rule is
not valid for data that is non-normal. It is important to compare data captures to the
empirical rule to test the sample for normality.