ST Inf

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

[ST-INF] Statistical inference

Miguel-Angel Canela

Department of Managerial Decision Analysis

Contents

1. Sampling .............................................................................. 1
2. Parameter estimation .................................................................. 8
3. Testing means and variances .......................................................... 17
4. Simple linear regression ............................................................... 25
5. More on testing ...................................................................... 31

1. Sampling

1.1. Foreword
In this second part of the course, we turn to data analysis and, more specifically, statistical infer-
ence. It follows loosely the DeGroot & Schervish (2002) textbook, changing the order of appearance
only in special cases, as for simulation, but reducing the scope of the theoretical discussion here
and there, specially in estimation and testing. It differs from other introductory courses in that
multiple regression is not covered and analysis of variance is restricted to one-way ANOVA. Full
treatment of these topics is left for the econometrics course.
Computation is expected to be done in Stata. We indicate with special typeface Stata code, such
as in regress. Since the students of this course are expected to perform their statistical analyses,
they may consider using texts linked to Stata that contain moderate doses of each topic covered.
I recommend Hamilton (2009), a popular lightweight.
These notes are complemented with data sets and scripts. The scripts contain the commands
used in the examples. The data sets are in Stata format (extension .dta). They can be opened
by double-clicking or from Stata with the use command. All the material for the course can be
downloaded from http://blog.iese.edu/mcanela/mrm.

1.2. Distributional diagnostic plots


In most of the studies that may be interested in, the data analyzed are real data, obtained by
sampling from real populations, but, in certain cases, they are simulated, that is, generated by the
computer after specifying a probability distribution.
In a research context, data analysis can have two approaches. Whereas exploratory analysis helps
to choose a model that may be reasonably adequate to the actual data, confirmatory analysis is
concerned with testing a particular model. In the former, we can guess what models may be used
for our data, while the latter one is concerned with significance, evaluated through the P -values
(section 3). Somewhere in middle, distributional diagnostic plots may help in both approaches.
Distribution plots are used to check that a distributional assumption is reasonable for a particular
data set. In research papers, they are rarely reported, although they are frequently used.

[ST-INF] Statistical inference / 1 20140901


40
30
Frequency

20
10
0

0 50 100 150 200 250

Days

Figure 1.1. Distribution of the strike duration (Example 1.1)

A histogram is a (vertical) bar diagram in which the bars are based on intervals of values of the
variable whose distribution is examined. The height of the bars is proportional to their frequencies.
The scale of the vertical axis can be set in terms of frequencies (counts) or proportions. In the
Stata command histogram, proportions are the default option. The upper sides of the rectangles
of a histogram can be seen as an approximation to the density curve. The histogram can be thus
compared to the density of the candidate model. This is the theory, but, in practice, what we see in
a histogram depends on the choice of the intervals, specially in small samples. I would recommend
beginners to start with no more than 58 intervals whose extremes are round numbers.

Example 1.1. The strike duration data given by Kennan (1985) are frequently used to illustrate
duration data modelling in econometrics courses. They give the duration, in days, of 62 strikes
that commenced in June, from 1968 through 1976, each involving at least 1000 workers and began
at the expiration or reopening of a contract. The histogram (Figure 1.1), looks like an exponential
distribution.

J Kennan (1985), The duration of contract strikes in US manufacturing, Journal of Econometrics


34, 528.

Quantile-quantile (QQ) plots are scatterplots, in which the two axes correspond to quantiles of a
distribution. This presentation is restricted to a special QQ plot for the normal distribution, the
normal probability plot (command qnorm in Stata). It matches an empirical distribution, not to
an individual distribution, but to the whole normal distribution model.
The normal probability plot is based on the fact that there is a linear relationship between a normal
variable and the N (0, 1) distribution. Suppose that we have a sample of independent univariate
observations x1 , x2 . . . , xn . We put x(i) on one axis and the N (0, 1) quantile zi = 1 (i/(n + 1))
on the other axis. Then, if the data were extracted from a normal distribution, the n points in the
normal probability plot would be close to a straight line.

Example 1.2. The data set for this example contains the daily returns of the Brazil and Mexico
MSCI indexes. It has been extracted from the DataStream database and covers the whole year
2003, with a total of 261 observations (no data in week-ends). Returns are derived from the index
values as follows. If xt is the value of a particular index at day t, the daily return at this day is
given by rt = xt /xt1 1. The returns used here come in percentage scale.

[ST-INF] Statistical inference / 2 20140901


80

4
60

2
Daily returns
Proportion

0
40

2
20

4
0

6 4 2 0 2 4 3 2 1 0 1 2 3

Daily returns Standard normal quantiles

Figure 1.2. Histogram and normal probability plot (Example 1.2)

We use only the Brazil data. The estimates for the mean and the standard deviation (see para-
graphs 1.3 and 1.4) are = 0.273 and = 1.456. You can see an histogram of the Brazil returns
in the left panel of Figure 1.2. The distribution is not really skewed, but the tails seem to be fatter
than those of the normal distribution. The right panel of Figure 1.2 is a normal probability plot,
including a straight line. The line has been chosen so that it passes through the first and third
quartiles (others fit a regression line). You may find in this graphic the traits already identified in
the histogram. These traits could be predicted from the sample skewness and kurtosis (paragraph
1.4).
This is one example of what in finance is called fat tails, a special pattern of departure from
the normal distribution. Since the normality of the returns was taken for granted in the classical
portfolio theory, the persistent evidence of fat tails found in financial returns data has been
discussed many times. Nowadays, the normality assumption has already been dropped.

1.3. A primer on simulation


The term sample is used in statistics with various meanings, depending on the context:
A (statistical) sample of a given distribution is a set X1 , . . . , Xn of independent random
variables, all with that distribution. n is the sample size. If we pick a value of each Xi , we
get a a sequence x1 , . . . , xn of values, which is also called sample. To simulate a distribution
is to produce such a sequence of numbers. This paragraph is devoted to simulation.
Given a (real) population, a sample is a subset of the population. In most cases, samples
are assumed to have been extracted randomly, following a procedure in which all samples
of that size have the same probability of been extracted. Frequently, this assumption is
unrealistic. Biased samples are those extracted according to a procedure that would lead,
in the average, to an error in the estimates derived from the samples. This will be more
clear in the next chapter. On econometrics, when the units of a sample are all extracted
at the same time, so that the sample can be taken as a picture of the population at that
time, the sample is called a cross section.
Simulation is very useful when learning statistics, since it helps to understand the models by looking
at the results that could be expected when observing variables for which these models are valid.
It is also useful in research, when searching for the properties of the distribution of a variable for

[ST-INF] Statistical inference / 3 20140901


4

3
2
2

1
0
0

1
2
2

3
4 2 0 2 4 4 2 0 2

Figure 1.3. Scatterplots of simulate bivariate normal samples, with = 0.75 and = 0.25

which we do not a have a simple formula. This is seen through the practice.
The only thing that computers really simulate is the uniform distribution in the unit interval.
For instance, 0.5841526, 0.2326198, 0.6901792, 0.8181496 and 0.0532115 are five random numbers,
generated with the function runiform() of Stata (called uniform() before Stata 10.1). The rest
of the distributions simulated are obtained from this by means of diverse transformations, that can
be invented by the user or be available in your software of choice.

Since they are not really uniformly distributed, but only approximately, some call pseudorandom
what we call here random numbers. The distinction is irrelevant at the level of this course.
Stata has command for generating samples for many special distributions (see the manual). The
standard normal can be simulated with the Stata command rnormal(), whose syntax is similar that
of runiform(). With rnormal(m,s), we can specify the mean and the standard deviation (not the
variance). The drawnorm command (this is a command, not a function) can be used for sampling
from a multivariate normal distribution. There are various ways to specify the distribution, through
the options of drawnorm. We have used drawnorm to produce the simulated samples of Figure 1.3,
which correspond to bivariate normal distributions with standard marginals, with = 0.75 and
= 0.25, respectively.

1.4. Sample mean


In mathematical statistics, we use common means but take them as random variables. Take a
sample X1 , . . . , Xn of a probability distribution with expectation . Then,
X1 + + Xn
X =
n
is a random variable, called the sample mean. We expect X to provide an approximation of the
(population) mean , more reliable as n increases. With the probability language, we can deal
with this issue in rigorous way. First,
n
1 X
E X = E[Xi ] = .
n i=1

This means that, in the average, the sample mean is right. Since the variance is a measure of the

[ST-INF] Statistical inference / 4 20140901


variation about the expectation, we next look at
n
1 X 2
var X = 2 var[Xi ] = .
n i=1 n

Note that the independence of the observations has been used here but not in the preceding
argument. From the expression of the variance, we see that it tends to zero as n . This
means that the variation of X becomes irrelevant for big samples. We say that X converges to
as n . This statement, called the law of large numbers, is one of the great theorems of
mathematical statistics.
Although the idea of the law of large numbers should be clear enough to you, a comment on limit
theorems is worthwhile here. Much effort, when writing statistics textbooks, is put on distinctions
among the different types of convergence and on the proofs of limit theorems. Why is this? The
definition of the limit of a sequence of numbers has nothing to hide, because numbers are simple
things, but a random variable carries a lot on its back. What converges, the values of the variables,
the densities, parameters like means and variances . . . ? The fact is that there are different types of
convergence, and developing them here will take more space than allowed. So, the discussion here
is quite short. The law of large numbers is easily proved if we formulate it in terms of convergence
in probability. It can be stated as: for every number > 0,

lim p |X | > = 0.
n

This is expressed, in short, as plimX = . The proof is based on the Chebyshev inequality,

2
p |X | > 2 ,

which is valid for any probability distribution with moments of first and second order, and not
hard to prove. We will come back to limit theorems in the next chapter.

1.5. Sample covariance


Let us first refresh the formulas of the covariance and the variance used in data analysis, e.g. in
simple linear regression,
n n
1 X 1 X 2
sxy = xi x yi y , s2x = xi x .
n 1 i=1 n 1 i=1

Except for the detail that the denominator is n 1 instead of n (this will be justified in section
2), these two measures are particular cases of covariances, and therefore inherit the properties of
the covariance. Some properties of these formulas can also be derived from the analogy between
the covariance and the product of two vectors. For instance, in the same way that we normalize
a vector by dividing it by the modulus, we can do it with zero-mean variables, dividing by the
standard deviation. This is used to standardize the columns of a data set, producing z-scores.
Another example of this analogy: since the product of unit vectors is equal to the cosine of their
angle, the covariance of two unit variance columns is a cosine, i.e. a number between 1 and 1.
This statistical cosine gives the formula of the correlation that we use in data analysis. For more
than two variables, we can calculate covariance and correlation matrices.
Replacing the numbers xi , yi by statistical samples Xi , Yi , we get the definitions of the sample
covariance,
n
1 X
SXY = Xi X Yi Y ,
n 1 i=1

[ST-INF] Statistical inference / 5 20140901


1 2 3 4 5 6 1 2 3 4 5 6

Figure 1.4. Distribution of the mean of two (left) and five (right) throws of a die

the sample variance,


n
2 1 X 2
SX = Xi X ,
n 1 i=1
and the sample correlation
P
SXY Xi X Yi Y
RXY = =h .
SX SY P 2 P 2 i1/2
Xi X Yi Y

Higher order sample moments, such as skewness and kurtosis, are defined in a similar way. All these
definitions are random variables, which we will use to estimate the parameters of the distribution
sampled (section 2).

Example 1.2 (continuation). In Example 1.2, the sample mean and covariance are

0.273 2.120 0.543
x = , S= .
0.105 0.543 0.934

The normality of the Brazil returns was explored graphically in paragraph 1.1. The sample skewness
and kurtosis are Sk = 0.193 and K = 0.842, respectively, the latter suggesting to discard
normality assumptions.

1.6. The central limit theorem


The central limit theorem is another grand name of mathematical statistics. It grants that, for big
samples, the normal distribution gives a good approximation to the distribution of many interesting
sample statistics, such as means or variances, which are proportional to a sum of independent and
equally distributed terms. There are several versions of the theorem, which differ on the amount of
assumptions made. Of course, the less assumptions you make, the longer the proof. Since we are
not concerned with the technicalities, I state the theorem in a loose and general form, that may
applied to many particular situations. You may find in many textbooks proofs of some particular
cases, which have preceded the general theorem in the history of probability. For instance, the
application of the central limit theorem to the binomial setting is an older result, the Laplace-De
Moivre formula, found in many elementary textbooks.

[ST-INF] Statistical inference / 6 20140901


Suppose that X1 , X2 , . . . , Xn , . . . is a sequence of independent equally distributed variables, with
finite mean and variance, and 2 , and denote by Xn the mean

X1 + + Xn
Xn = .
n

According to the central limit theorem, the CDF of Xn converges to the standard normal as
n . This means, in practice, that we can use the normal as an approximation to calculate
probabilities related to Xn . An equivalent version of the theorem uses the sum instead of the mean,
replacing and by n and n, respectively. Comments about a particular distribution being
asymptotically normal usually refer to the approximation of this distribution by a normal. When
such approximations are available, it is customary to specify whether the true distribution or the
approximation is used. The terms exact and asymptotic are commonly used to this purpose.

The convergence granted by this theorem is not the same as that of the law of large numbers.
Here, we do not say that the variables converge to a limit, but that the distributions do.

One of the great things of the central limit theorem is that there is no restriction on the type of
distribution, as long as it has moments of first and second order. In particular, it can be used for
the continuous approximation to a discrete distribution. To see how this works, look at the plots of
Figure 1.4. For the outcome of a regular die, the distribution is uniform, with probability 1/6. But
the distribution of the mean of two dice (left), is no longer uniform, but triangular. What happens
when we increase the number of dice averaged? That the number of possible values increase, so
that a continuous approximation makes sense, and the PDF gets closer to the bell shape of the
standard normal (right).
By virtue of the central limit theorem, the normal can be seen as the limit of many distributions.
For instance, a B(n, ) variable is the sum of n independent Bernouilli variables. Then,

B(n, ) N n, n(1 ) .

Also, a Poisson variable with = n can be seen as the sum of n independent Poisson variables
with = 1. Then,

P(n) N n, n .

Figure 1.5 illustrates these approximations. On the left, we see the B(100, 0.1) probabilities and the
N (10, 9) approximation. On the right, the P(50) probabilities and the N (50, 50) approximation.
It is worth to remark, with respect to the approximation of the binomial, that the quality of the
approximation depends, not only on n, but also on , improving as gets closer to 0.5.
Finally, the the central limit theorem also applies to the 2 distributions, so that, for a high n, the
2 distribution is asymptotically normal,

2 (n) N (n, 2n).

1.7. Homework
A. Explore the normality of the Mexico returns in Example 1.2.

B. Simulating a multivariate uniform distribution is easy. For instance, from two samples (same
length) of pseudorandom numbers we simulate the uniform distribution on the unit square
[0, 1] [0, 1]. After discarding the points where x + y > 1, we get a sample of the uniform
distribution of Example 3.13. Use this approach to generate a sample of size 200 and plot
it to convince yourself.

[ST-INF] Statistical inference / 7 20140901


0.12

0.05
0.10

0.04
0.08

0.03
0.06

0.02
0.04

0.01
0.02
0.00

0.00
0 5 10 15 20 30 40 50 60 70

Figure 1.5. Normal approximation for the B(100, 0.1) (left) and P(50) (right)

C. Draw a random sample of size 1000 from a trivariate normal distribution, with

1 1
= 2 , = 0.2 1 .
1 0.8 0.6 1

When you get your random sample:


(a) Check the means and variances, draw the histograms and the probability plots, and
compute skewnesses and kurtosis, to convince yourself that the marginals are the ex-
pected normal distributions.
(b) Calculate the covariance and the correlation matrices.

D. Draw ten random samples of length 100 of an exponential distribution and average them.
Compare a histogram of one of the samples with a histogram of the mean.

E. Draw a random sample of size 250 of a P(25) distribution. Plot a histogram and a normal
probability plot. How normal does this look?

2. Parameter estimation

2.1. Statistical inference


Roughly speaking, statistical inference is the process of drawing conclusions about a probability
model, based on a data set. The model can be formulated as a probability distribution or as a set
of regression equations relating some (dependent) variables to other (independent) variables. As
you will see in the econometrics course, in many cases and, in particular, in the regression line of
section 4, the model is just a conditional probability model. The conclusions are usually related
to the values of some parameters of the model.
More specifically, inference is concerned with one of the following tasks:
Estimation. In many statistical analyses, we assume that the probability distribution that
generated the data is known, except for the values of one or more parameters. I denote here
the parameter by , accepting that can be multidimensional (I use boldface in that case).

[ST-INF] Statistical inference / 8 20140901


The range of acceptable values of the parameters is called the parameter space. Example:
for a normal distribution, the two-dimensional parameter is = (, 2 ), and the parameter
space is R (0, +).
Testing. In hypothesis testing, we are concerned with the values of some unknown param-
eters of a prespecified model. We set a formal hypothesis about these parameters, such as
1 = 2 , or = 0, and the analysis leads to the acceptation or rejection this hypothesis.
The hypothesis tested is related to some theoretical hypothesis, usually in such a way that
the rejection of the statistical hypothesis supports the theoretical hypothesis.
Prediction. Another form of inference deals with the prediction of random variables not yet
observed. For instance, we can model the service times for customers with an exponential
distribution, wishing to predict the service time for the next customer. In certain applica-
tions, such as modeling equity prices with time series models in finance, prediction is the
key issue, and the models themselves may be irrelevant.
Decision. In certain contexts, after the data have been analyzed, we must choose within a
class of decisions with the property that the consequences of each decision depend on the
unknown value of some parameter. For instance, health authorities may decide about giving
the green light to a new drug, based on the results of a clinical trial. Decision theory is not
covered in this course.
Experimental design. In the experimental sciences, the researcher develops, before the data
collection, a detailed plan in which the values of some independent variables are specified.
Such a plan is called a experimental design. Guidelines for developing experimental designs
are usually included in courses for experimental researchers (including psychology and mar-
ket research). We skip this here, since you are expected to deal with observational data, in
which such designs are not feasible. Experimental research and the subsequent data anal-
ysis are called conjoint analysis in market research and policy capturing in organizational
research.
The rest of this course is concerned with estimation and testing. Whereas this section sets the
framework for the discussion of estimation, sections 3, 4 and 5 are focused on testing.

2.2. Sampling distributions


Let X1 , . . . , Xn be a sample from a distribution. A function G = g(X1 , . . . , Xn ) is called a statistic
(singular). A statistic may be used as an estimator of an unknown parameter of the distribution.
We distinguish between the estimator and its individual values, called estimates. Since there may
be many potential estimators for a parameter, eg the mean and the median for in the N (, 2 )
distribution, we want to use the estimators with better properties. Thus, textbooks discuss the
desirable properties that potential estimators may have. For instance, for g linear, we have a linear
estimator. Linear estimators are usually preferred. The distribution of an estimator is called the
sampling distribution.
If is a parameter, both estimators and estimates of can be denoted by . This is practical in a
theoretical discussion, or when there is no consensus on the estimator. For instance, if denotes
the mean of a distribution, the usual estimator of is the sample mean, although the sample
median can also be used (Example 2.1). For the sample mean, we write X instead of .
The sample mean is the preferred estimator of the mean, since the sampling distribution has any
desirable property. I have already presented the formulas for the expectation and the variance
of the sample mean in section 1. For a normal distribution, we can say more: when sampling
from a N (, 2 ) distribution, X N (, 2 /n). This is simple to understand, since n independent
observations from a normal distribution from an n-variate normally distributed random vector,
and any linear combination of the components produces a normal variable. If the distribution is
not normal, besides the special properties that the sample mean may have for particular models,
the central limit theorem ensures the asymptotic normality of the sample mean, meaning that, for
big samples, the above distribution can be used as an approximation to the distribution of the

[ST-INF] Statistical inference / 9 20140901


120

100
100

80
80

60
60

40
40

20
20
0

0
1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0

Sample mean Sample median

Figure 2.1. Distribution of the sample mean and median

sample mean.
The sample mean is an example of a moment estimator. More generally, we can replace, in the
formula of the moment, the expectation operator E for an average of the corresponding powers of
the observations. For instance, the statistic

n
1 X 2
2 = Xi X
n i=1
is the moment estimator of the variance. Unfortunately, the expectation of this statistic does not
coincide with 2 . Indeed, since all the terms of the sum in the right have the same expectation
(we assume here = 0, to shorten the equations),
1 2 1 2
E 2 = E X1 X = E X1 + E X 2 2E X1 X
n n !
Xn
1 2 2 1 2 2 (n 1) 2
= + 2 E X1 Xi = 2 + 2 = .
n n i=1
n n n n

This explains why the statistic


n
2 1 X 2
S = Xi X
n 1 i=1

is preferred as an estimator of 2 . S 2 is called (in most places) sample variance. The formula for
var[S 2 ] is more complex, involving the kurtosis. Nevertheless, under normality, the distribution
of the sample variance can be related to the 2 model (the proof involves some matrix algebra).
More specifically,
(n 1)S 2
2 (n 1).
2
In particular,
2 4
var S 2 = .
n1
The examination of the joint distribution of the mean and sample variance shows that, again under
a normality assumption, these estimators are independent. Assuming = 0 to make it shorter, we

[ST-INF] Statistical inference / 10 20140901


3.0



150

2.5






Sample variance

2.0






100

1.5

































1.0









50

0.5


















0

0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 1.0 0.5 0.0 0.5 1.0

Sample variance Sample mean

Figure 2.2. Distribution of the sample variance

have h 2
i Xi 2
E Xi X X = E Xi X E (X)2 = E = 0,
n n
meaning that X and Xi X are uncorrelated. Under normality, this implies that they are inde-
pendent, and so are X and S 2 .

Unfortunately, there is no complete agreement on the denominator in the formula of the variance
estimator, and some authors use n in the denominator.

Example 2.1. Sampling distributions are easily understood through simulation. In this example,
n = 10 and the distribution sampled is N (0, 1). We generate 1000 samples, saving the means,
medians and variances in a data set that contains 1,000 observations of each of these three statistics.
The histograms of the sample mean and the sample median (the sample median of 10 observations
is the midpoint between the fifth and the sixth observations) are shown in Figure 2.1. The means
are 0.0039 and 0.0012, respectively, close to the zero population mean. The standard deviation
of the sample mean is 0.3112, close to the theoretical value 10001/2 = 0.3162. The standard
deviation of the sample median is 0.3652, a bit higher. This agrees with the theory (paragraph
2.3), and supports the preference for the mean.
The left panel of Figure 2.2 is a histogram of the sample variance, consistent with the expected
2 profile. The mean is 1.0021 and the standard deviation 0.4732. Note that, according to the
theory, the sample variance is one ninth of an observation of a 2 (9). The correlation between the
sample mean and variance is 0.0167, also in agreement with the theory. This is illustrated in the
right panel of the figure.

2.3. Properties of an estimator


This section is devoted to a brief description of the properties that make an estimator adequate,
restricting the detail to the estimation of a single (unidimensional) parameter. We are interested
in the properties related to the sampling distribution: mean, variance, normality etc. We start
with the bias. The bias of an estimator of an unknown parameter is the mean deviation with
respect to the true value of the parameter,

B = E .

[ST-INF] Statistical inference / 11 20140901


Taking the deviation as the error of our estimate, the bias has a direct interpretation as a
mean error. An unbiased estimator is one for which the bias is null. For instance, from a previous
discussion, we know that the sample mean is unbiased, and, after correcting the denominator,
the sample variance is also unbiased. Moment estimators of skewness and kurtosis are sometimes
corrected in a similar way (not in Stata).
The mean square error, defined as
h i
MSE = E ( )2 ,

also has a direct interpretation. It can be shown to have two components,


2
MSE = B + var .

The standard deviation of an estimator is usually called the standard error. We denote it by se[].
Among unbiased estimators, the one with the lower standard error is preferred. This is called
efficiency. More explicitly, if 1 and 2 are unbiased estimators of , we say that 1 is more efficient
than 2 when se[1 ] se[2 ]. For instance, both the sample mean and the sample median can
be used as estimators of for an N (, 2 ) distribution, but the mean is more efficient. We have
found this in the simulation of Example 2.1, but a mathematical proof is more difficult. In many
cases maximum efficiency is sought among linear estimators, leading to the concept of best linear
unbiased estimators (BLUE). Best means here minimum variance.
Another approach to the assessment of an estimator is based on its convergence as the sample size
tends to infinity. In this line, a common requirement for an estimator is consistency. Let me use
the notation n , to emphasize the dependence of the estimator on the sample size n. We say that
the sequence n is consistent when plim n = . Based on the Chebychev inequality (section 1), it
can be shown that lim se[n ] = 0 implies consistency. Thus, those estimators whose variance have
an n in the denominator, as the sample mean and variance, are consistent.
Another desirable property is asymptotic normality. A sequence of estimators n is asymptotically
normal when the CDF of (n E[n ])/sd[n ] converges to the standard normal CDF as n .
This means, in practice, that certain estimators are taken, for big samples, as if they were normally
distributed. For instance, owing to the central limit theorem, the sample mean is asymptotically
normal. Also, the maximum likelihood estimation method, that we leave for the econometrics
course, produces asymptotically normal estimators in many situations, making the inference from
the estimates much simpler.
All these definitions can be extended to estimators of a multidimensional parameter without pain
(assuming that you are familiarized with matrix and vector formulas). If we take a random vector
as an estimator of a parameter vector , the bias is a vector and the MSE a matrix, given by
h T i T
MSE[] = E = B B + cov .

Also, the variance is replaced by the covariance matrix in the efficiency comparisons: 1 is more
efficient than 2 when cov[2 ] cov[1 ] is positive semidefinite. Although the definitions are so
easily extended, handling efficiency becomes a bit involved. I leave this here.

2.4. The t distribution


The inference about the mean of a univariate normal distribution is based on the fact that, if X
has a N (, 2 ) distribution, then

X
Z= N (0, 1).
/ n

[ST-INF] Statistical inference / 12 20140901


0.4
0.3
0.2
0.1
0.0

3 2 1 0 1 2 3

Figure 2.3. Density curves N (0, 1) and t(5) (dashed line)

If we dont know (this is what happens in practice), we can replace by S, getting

X
T = .
S/ n

The distribution is no longer the standard normal, but a different distribution, called the Student t
distribution. It is a symmetric distribution, with zero mean and a bell-shaped density curve, similar
to the N (0, 1) density (Figure 2.3). Like the normal, the Students t is not a single distribution,
but a distributional model, in which an individual distribution is specified by a parameter, which,
as in the 2 , is a positive integer called the number of degrees of freedom (df).
The formula for the Students t(n) PDF (n is here the number of degrees of freedom) is
(n+1)/2
(n + 1)/2 x2
f (x) = 1+ .
(n)1/2 n/2 n

As given here, this formula still makes sense when n is not an integer. Nonintegers can be used
in certain nonstandard tests.
For an alternative definition, take two independent variables X and Y , and the t Student is
produced as
X
X N (0, 1), Y 2 (n) = t(n).
Y
Because of the symmetry with respect to zero, the mean and the skewness of a t distribution are
null (the skewness converges only for n > 3). For n > 2, the variance is n/(n 2) (infinite for
n = 1) and the kurtosis 6/(n 4) (n > 4). Since this is relevant for a low n, the Students t is
sometimes used in finance to replace the normal as a model for the distribution of the returns of
a security index.
We denote by t (by t (n) if there is ambiguity)
the critical values of the Students t. This means
that, if X has a t distribution, then p X > t = . The Students t converges (in distribution)
to the standard normal as the number of degrees of freedom tends to infinity. This, in practice,
means that, denoting by Fn the CDF of the t(n) distribution, we have, for x R and 0 < < 1,

lim Fn (z) = (z), lim t (n) = z .


n n

[ST-INF] Statistical inference / 13 20140901


The Stata functions for the Students t are similar to those for the standard normal: tden(n,t)
gives the PDF, ttail(n,t) the area of one tail and invttail(n,q) the critical value. We illustrate
this with some calculations:
For the standard normal, z0.025 = 1.96 gives a 95% interval. The corresponding t value, for
2 degrees of freedom is t0.025 (2) = 4.3027, obtained with invttail(2,0.025).
Increasing degrees of freedom, the critical value gets closer to 1.96, as predicted by the
theory: t0.025 (5) = 2.5706, t0.025 (10) = 2.2281, t0.025 (25) = 2.0595, t0.025 (100) = 1.9840,
etc. A similar process can be carried on starting with ttail(2,1.96) and increasing the
degrees of freedom. This should lead to the limit 0.025 (approx.)

2.5. Confidence limits for a mean


Although we are interested in confidence regions in general, we start with the discussion of a
particular case, that of the confidence limits of the mean. Once the ins and outs of this case are
understood, it is easy to extend the idea to the general setting. Suppose that X has a N (, 2 )
distribution. In the 95% of the cases, we get

x 1.96 < < x + 1.96 .
n n

This formula gives limits for , called the 95% confidence limits for the mean. If X is not normally
distributed but n is high (in many cases it suffices with n > 25), this formula gives an approximation
which, in general, is taken as acceptable. Replacing 1.96 by an adequate critical value z , we can
switch from the 95% to our probability of choice. Thus, the formula

x z
n

gives the limits for a confidence level 1 2. If the confidence level is not specified, it is understood
that it is 95% ( = 0.025). With the confidence limits, we can compare the sample mean x to a
reference value 0 . If 0 falls out of the limits, we conclude, with the corresponding confidence
level, that 6= 0 . We say then that the difference x 0 is significant.
With real data, is unknown, but, for a big n, it can be replaced by s, obtaining an approximate
formula for the confidence limits of the mean. Nevertheless, there is an exact formula, appropriate
for a small n (the difference between both formulas becomes irrelevant for big samples), in which
z is replaced by t (n 1). The formula is then

s
x t (n 1) .
n

All this was said assuming a normal distribution. If this assumption is not valid, the formula of the
confidence limits is still approximately valid for big samples, by virtue of the central limit theorem.
In such case, using either z or t does not matter, since they will be close. An application of this
ideas is the formula of the confidence limits for a proportion, discussed in the next paragraph.

Example 2.2. Over the last decade, several large-scale cross-cultural studies have focused on well-
being in a wide range of nations and cultures, but, in general, Latin countries have only been
sporadically represented in these studies. In a recent study, the influence of gender, marital status
and country citizenship on different aspects of well-being has been examined, testing the uniformity
within the Latin world and comparing the variance due to the country effect with those due to
the gender and marital status effects. The data were collected on a sample of managers following
a part-time MBA program at business schools from nine Latin countries. I use here data on job
satisfaction (average of a 12-item Likert scale) from three countries, Chile (CH), Mexico (ME) and
Spain (SP).

[ST-INF] Statistical inference / 14 20140901


35
30

5
25

Job satisfaction
Proportion

20

4
15
10

3
5

2
0

2 3 4 5 6 2 1 0 1 2

Job satisfaction Standard normal quantiles

Figure 2.4. Histogram and normal probability plot for the Chile group (Example 2.2)

After removing the cases with missing values (listwise), the sample size is n = 423, and the group
sizes n0 = 121, n1 = 111 and n2 = 191, for Chile, Mexico and Spain, respectively. The group
statistics are reported in Table 2.1. In Stata, these statistics are obtained with the command
tabstat.
TABLE 2.1. Group statistics (Example 2.2)

Statistic Chile Mexico Spain Total

Size 121 111 191 423


Mean 4.158 4.413 4.162 4.227
Stdev 0.902 0.865 0.814 0.858

We calculate next 95% confidence limits for the mean of the Chile subpopulation. Based on
t0.025 (120) = 1.980, we get 4.158 0.162. Of course, for such a sample size, using the t or the
N (0, 1) critical value does not matter. Also, we can leave aside the concern about the normality
of the distribution, although the diagnostic plots of Figure 2.4 show that normality is questionable
here. Note that, since we are dealing with the Chile mean, the point is whether we assume or not
the normality of the distribution in the Chile subpopulation. This is not a trivial issue, because
we dont make any assumption on the global three-country distribution. It is not clear what such
distribution would be, not only because the means may be different, but because we are sampling
from the three countries in proportions that have no meaning in population terms.
In Stata, the limits can be obtained with directly, using invttail(120,0.025), but there is a
CI calculator that performs these calculations without bothering the user with critical values
(command ci).

Source: S Poelmans & MA Canela, Statistical analysis of the results of a nine-country study
of Latin managers, XIth European Congress on Work and Organizational Psychology (Lisboa,
2003).

[ST-INF] Statistical inference / 15 20140901


2.6. Confidence limits for a proportion
Take a Bernouilli distribution with success probability . In this case, the mean of n observations
is just the proportion of successes, which I denote by p. Owing to CLT, the sampling distribution
of p approaches the normal for big samples. Since here 2 = (1 ), we have, with a probability
1 2, r r
(1 ) (1 )
z < p < + z .
n n
In the formula of the confidence limits, since is unknown, it is replaced by p, getting
r
p(1 p)
p z .
n

Survey designers called the summand on the right the sampling error (that is how they report
the error). The sampling error is an assessment of the magnitude of the error that we can get
in extrapolating to the population the estimate derived from a random sample (be careful here,
randomness is not granted in many surveys). If nothing is said in the contrary, z = 1.96 is used.
Sometimes, the sampling error is calculated previous to the survey, when p is still unavailable. If
an initial estimate (or guess) is available, it is used in the formula. If not, we use p = 0.5, which
is the worst possible case, since the maximum value of p(1 p) is attained for p = 0.5 (check
this!). In this case, rounding z0.025 = 1.96 to 2, the sampling error can be approximated by 1/ n.
This provides the practitioners with a rule of thumb: for n = 100, the sampling error is 10%, for
n = 400, it is 5%, etc. This explains the sizes used in the surveys that are currently reported in
the media, in which going far beyond n = 1600 would not compensate the increase in the cost of
the survey.

Example 2.3. A survey is planned on the consumption of soft drugs by boys/girls of ages from 15
to 20 years in a certain population. Assuming simple random sampling from a big population and
a percentage of consumption of about 20%, how big must the sample be for ensuring, with a 95%
confidence, that the error in the percentage estimated is lower than 5%?
The sample size n must be such that the sampling error corresponding to a 95% confidence level
is less than 5%. Taking p = 0.20, this means that
r
0.2 0.8 1.962 0.2 0.8
1.96 < 0.05 = n > = 245.86.
n 0.052

Therefore, the sample size must at least 246. If the initial 20% estimation were not available, we
would use the 50%. This would lead is to
1.962 0.5 0.5
n> = 384.15.
0.052

2.7. Homework
A. Suppose that a random variable X can take the values 1, 2, 3, 4, 5, with probabilities
f (1) = 3 , f (2) = 2 (1 ), f (3) = 2(1 ), f (4) = (1 )2 , f (5) = (1 )3 .

Here, the value of the parameter (0 1) is unknown.


(a) Check that this defines a probability distribution for any in the specified range.
(b) Given a constant c, consider an estimator = h(X) defined by
h(1) = 1, h(2) = 2 2c, h(3) = c, h(4) = 1 2c, h(5) = 0.
Show that is an unbiased estimator of for any c.
(c) Find the value c() for which has minimum variance.

[ST-INF] Statistical inference / 16 20140901


B. The standard deviation is usually estimated by the square root of the sample variance,
v
u n
u 1 X 2
S=t Xi X ,
n 1 i=1

called the sample standard deviation. It is a biased estimator. Positively or negatively?



C. The mean absolute deviation (MAD), defined as E |X | , is used sometimes as a measure
of dispersion. The sample version,
n
1 X
MAD = xi x
n i=1
is used then as an estimator.
p
(a) Prove that, in a normal distribution, E |X | = 2/ .
p
(b) Part (a) suggests using the estimator = /2 MAD. Generate 10000 independent
samples of size 5 of the standard normal and calculate the corresponding estimates of
= 1 based on the mean absolute deviation. Check that this estimator is more biased
than the sample standard deviation, but the variance is similar.
1
D. The Student distribution with 2 degrees of freedom has density f (x) = 3/2 .
2 + x2
Calculate p[0 < X < 1] with the trapezes formula, checking the result in Stata.
E. Give an asymptotic formula for the 95% confidence limits of the mean of a Poisson distri-
bution.

3. Testing means and variances

3.1. An example
The logic of hypothesis testing is not obvious for the beginner. Only with some experience in
testing one realizes the efficiency that results from using the same argument in many different
situations. Thus, I skip the theoretical discussion, although something else will be said in section
5. So, instead of a formal definition, we start with an example.

Example 3.1. The data set for this example includes data on wages in years 1980 and 1987. The
sample size is n = 545. The variables are: (a) NR, an identifier, (b) LWAGE0, wages in 1980, in
thousands of US dollars and (c) LWAGE7, the same for 1987. The wages come in log scale, which
helps to correct the skewness. Do these data support that there has been a change in the wages?
Note that we dont care about individual changes, but about the average change.
To examine this question, we introduce a new variable X, corresponding to the difference between
these two years (1987 minus 1980). We search for evidence that the mean of X is different of zero.
Our first analysis is based on a confidence interval. The basic information is
x = 0.473, s = 0.606, n = 545.

Given the sample size, we shouldnt worry about the skewness, but, anyway, the histograms of
Figure 3.1 support the use of the log scale. With the appropriate t factor (t0.025 (544) = 1.964), we
get the 95% confidence limits 0.422 and 0.524. Since this interval does not contain zero, we can
conclude (95% confidence) that there has been a change. We say then that the mean difference
0.473 is significant or, more specifically, that it is significantly different from zero.

Source: JM Wooldridge (2000), Introductory Econometrics, South Western College Publishing.

[ST-INF] Statistical inference / 17 20140901


500

200
400

150
300

100
200

50
100
0

0
0 10 20 30 40 50 1 0 1 2 3 4

Wages Log wages

Figure 3.1. Distribution of wage differences (Example 3.1)

3.2. The one-sample t test


More formally, let me denote by is the population mean of the wage increase (in log scale).
Then, the conclusion of the above argument can be stated saying that we tested the null hypothesis
H0 : = 0. When applied to the actual data, the test rejects H0 . So, we conclude that 6= 0,
with 95% confidence. An alternative way to perform the test is based on the test statistic
X
T = ,
S/ n
which, assuming that H0 is valid and, implicitly, that the distribution of X is normal, follows a
t(n 1) distribution. We call this a t test. The absolute value of the statistic is compared with
the critical value t0.025 , which corresponds to a 95% interval. If the critical value is exceeded, H0
is rejected, with 95% confidence. We say then that the t value is significant.
Although the 95% level is a standard, we can change it by replacing t0.025 by the t critical value
corresponding to the 1 2 confidence level. Mind that the use of other levels than the usual 95%
has to be justified. Here, the value
0.473
t= = 18.22
0.606/ 545
exceeds the critical value 1.964. Again, we reject H0 and conclude that there is a change in mean
wages.
The result of this test is usually presented in terms of a P -value, which is the exact 2-tail probability
associated, under the null, to the actual value of the statistic on which the test is based (a t statistic
in this example) or to a more extreme value. It is taken as a measure of the extent to which the
actual results are significant (the lower P , the higher the significance). With a 95% confidence
level, we consider that there is significance when P < 0.05. In the example,

P = p |T | > 18.22 = 1.54 1058 .

This P -value is obtained, in Stata, with ttail(544,18.22). By replacing x by x 0 , this test


can be applied to a null H0 : = 0 , in which 0 is a prespecified value. Finally, the statistic
X 0
Z= ,
/ n

[ST-INF] Statistical inference / 18 20140901


35
35

30
30

25
25

20
20

15
15

10
10

5
5
0

0
2 3 4 5 6 1 2 3 4 5 6

Figure 3.2. Distribution of job satisfaction in Chile and Mexico (Example 2.2)

which has a N (0, 1) distribution under the null, can be used if is known. Note that, in Example
3.1, replacing t(n 1) by N (0, 1) means changing the critical value 1.964 by 1.960. Without
normality, the same methods are approximately valid for big samples. It is generally agreed that
n 50 is big enough for that.

3.3. The two-sample t test


We consider now a null H0 : 1 = 2 , where 1 and 2 are the means of two distributions.
The two-sample t test applies to two independent samples of these distributions. In the simplest
variant, it is assumed that the variance is the same for the two distributions compared (1 = 2 ).
If this assumption is not valid, we use a second variant, a bit more involved. Because of this extra
complexity, textbooks frequently present a complete justification of the first variant, giving less
detail about the second variant. Nevertheless, this complexity is irrelevant with a computer at
hand, because you can run both variants in seconds. In practice, the P -values of the two tests are
quite close, except (possibly) when the sample sizes are very different.
In this test, both distributions are assumed to be normal, but this can be relaxed for big samples, as
in paragraph 3.2. Let me assume first equal variances, and take two independent samples, of sizes
n1 and n2 , from the distributions N (1 , 2 ) and N (2 , 2 ), with means X1 and X2 , respectively.
Then X1 X2 is normally distributed and, due to the properties of the mean and the variance
seen in section 1, satisfies

1 1
E X1 X2 = 1 2 , var X1 X2 = 2 + .
n1 n2

If is known (this seldom happens), we can use the statistic

X1 X2
Z= p ,
(1/n1 ) + (1/n2 )

with P -values provided by the N (0, 1) distribution. If is unknown, we replace it by the pooled
variance
n1 1 n2 1
S2 = S12 + S2,
n1 + n2 2 n1 + n2 2 2

[ST-INF] Statistical inference / 19 20140901





3.0












2.5






















2.0




















LWAGE7
















1.5
















1.0









0.5


0.0

1 0 1 2

LWAGE0

Figure 3.3. Correlation (Example 3.1)

which is an unbiased estimator of 2 . We use then the t statistic

X1 X2
T = p ,
S (1/n1 ) + (1/n2 )

whose P -values are given by the t(n1 + n2 2) distribution. In a second variant of the test, it is
not assumed that 1 = 2 (nor that they are different). We use then
2 2
X1 X2 S1 /n1 + S22 /n2
T =p 2 , df = .
(S1 /n1 ) + (S22 /n2 ) (S12 /n1 )2 (S 2 /n2 )2
+ 2
n1 1 n2 1

Under the null, the distribution of T can be approximated by a Student t, with the degrees of
freedom given by the second formula. This is called the Satterthwaite approximation. The number
of degrees of freedom is rounded when the test is done by hand.

Example 2.2 (continuation). To illustrate the two-sample t test, I apply it to the mean difference
between Chile and Mexico in Example 2.2. I use first the equal-variances version. We have

x1 = 4.158, s1 = 0.902, n1 = 121, x2 = 4.413, s2 = 0.865, n2 = 111.

Then,
r
120 0.9022 + 110 0.8652 4.158 4.413
s= = 0.884, t= p = 2.196.
230 0.884 1/121 + 1/111

Therefore, P = 0.029. This can be obtained, in Stata, with 2*ttail(220,2.196). The t statistic
and the P -value are directly obtained, with the ttest command, whose default is the equal-
variances option. With the option unequal we get the alternative version, which gives t = 2.200
(df = 230, P = 0.029). As expected, the differences between the two versions of the test are
irrelevant.

[ST-INF] Statistical inference / 20 20140901


3.4. The t test for paired data
The independence of the samples in the two-sample t test is not a trivial issue, since the distribution
of the test statistic under the null can be far from a Student t if we relax this assumption. This
typically happens in the so called paired data. This expression is applied to a sample of two
(potentially correlated) variables, related to the same phenomenon, like two measures at different
times, or before and after a treatment is applied. The data set does not appear as two independent
univariate samples, but as a bivariate sample. The remedy is easy: we calculate a difference for
each two-dimensional observation and test the null = 0 for the variable thus obtained. This is
what we did in Example 3.1. Thus, the t test for paired data is nothing but a one-sample t test
applied to the difference. In Stata, the command ttest has options unpaired and paired.

Example 3.1 (continuation). The analysis of section 3.1 would usually be run as a paired data t test.
What if we do the wrong thing, applying a two-sample test? We get then t = 15.19 (df = 1088,
P < 0.001). So, the two tests lead to the same conclusion. Is this true in general? The answer is
no. In this case, in spite of the results being similar, the two-sample test can be easily shown to be
wrong, since the two variables are positively correlated (r = 0.310), which makes sense, since most
of the people with higher wages still get high wages seven years later. We will see in chapter 10
how to test the null = 0. For the moment being, we satisfy ourselves by illustrating this question
with Figure 3.3.

3.5. The F distribution

Some popular tests, such as those used in the analysis of variance and linear regression, are based
on the ratio of two sample variances. They are based on a well known model for the distribution
of these is ratios. For two independent samples with distributions N (1 , 2 ) and N (2 , 2 ), the
ratio of the sample variances,
S2
F = 12 ,
S2
follows an F distribution. The general formula for the PDF of an F distribution with (n1 , n2 )
degrees of freedom, briefly F (N1 , n2 ), is
n /2 n /2
n1 /2 + n2 /2 n1 1 n2 2 xn1 /21
f (x) = (n1 +n2 )/2 x > 0.
n1 /2 n2 /2 n2 + n1 x

An example is shown in Figure 3.4. The first factor is a normalization constant. n1 and n2 are
the parameters of this model. When it is used as a model for a ratio of two sample variances, n1
is associated to the numerator and n2 to the denominator.
We denote by F the critical value
associated to right tail. More explicitly, p F > F = . In Stata, calculations related to the F
distribution can be managed through the functions Ftail, invFtail and upperFtail.
With the and the t, 2 and F distributions are presented in textbooks as the distributions derived
from the normal. The F distribution can be related to the other two in two ways:
X1
Assuming independence, X1 2 (n1 ), X2 2 (n2 ) = F (n1 , n2 ).
X2
X t(n) = X 2 F (1, n).
It is important to keep in mind this second property, which implies that any t test can be seen as
an F test. The only thing lost in taking the squares is the sign. A consequence of this relationship
is that
|t(n)| > c F (1, n) > c2
or, equivalently, t/2 (n)2 = F (1, n).

[ST-INF] Statistical inference / 21 20140901


0.6
0.5
0.4
0.3
0.2
0.1
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Figure 3.4. Density curve F (4, 4)

3.6. The one-way ANOVA F test


In this section, we present the extension of the two-sample t test to k (independent) samples. This
test, which applies to a null H0 : 1 = = k , is associated to the analysis of variance (ANOVA).
I restrict the actual presentation to the analysis of variance of one factor, briefly one-way ANOVA.
In paragraph 3.2, the data set was composed by two independent groups. As the square of a
variable with a t(n 2) distribution has an F (1, n 2) distribution, I denote the squared t statistic
by F . It is easy to check that
n1 n2 2 2 2
x1 x2 = n1 x1 x + n2 x2 x
n
and, therefore, the squared t statistic can be written as
2 2

n1 x1 x + n2 x2 x
F = n 2 n1 n2 .
X 2 X 2
x1j x1 + x2j x2
j=1 j=1

Note that the numerator of this fraction is a sum of squares related to between-group variation,
whereas the denominator is related to within-group variation. The number of summands is the
same, n. The advantage of this expression is that it can be easily generalized to the case of k
samples, leading to the one-way ANOVA F test. Suppose now k independent samples, of sizes
n1 , . . . , nk , and let n = n1 + + nk be the total sample size. The data can be arranged as in
Table 3.1, where each group is a column (the columns can have different lengths) and the last row
contains the group means.
The denominator of the fraction in the above formula is equal to the sum of squares within sample
1, plus the sum of squares within sample 2. To generalize this, we use the within-group sum of
squares
Xn1 nk
X
2 2
SSW = x1j x1 + + xkj xk = (n k)s2 .
j=1 j=1

The numerator can be regarded as a sum of squares, some of them repeated, so that the number of
summands is the same as in the denominator. The generalization to k groups is straightforward,

[ST-INF] Statistical inference / 22 20140901


leading to the between-groups sum of squares,
2 2
SSB = n1 x1 x + + nk xk x .

Presenting the data as in Table 3.1 helps to understand what these sums of squares are. We can
consider two different sources of variability in this table. Vertically, we see the variability within
the groups, measured by SSW. Horizontally, in the means of the last row, we see the variability
between the groups, measured by SSB.
TABLE 3.1. Data for a one-way ANOVA test

Group 1 Group 2 Group k

x11 x21 xk1


x12 x22 xk2
.. .. ..
. . .
x1n1 x2n2 xknk

x1 x2 xk

The general one-way ANOVA F statistic is defined as


2 2
n k SSB n k n1 x1 x + + nk xk x
F = = .
k 1 SSW k1 s2

Note that, for k = 2, the factor k 1 can be omitted, leading to the formula given above. Under
the null, this F statistic has an F (k 1, n k) distribution, which can be used to calculate a
P -value.

Example 2.2 (continuation). In Example 2.2, the pooled variance is

120 0.9022 + 110 0.8652 + 190 0.8142


s2 = = 0.728,
420
and the F statistic,
2 2 2
420 121 4.158 4.227 + 111 4.413 4.227 + 191 4.162 4.227
F = = 3.58.
2 0.728

With (2, 429) degrees of freedom, this leads to P = 0.029 and, therefore, to the rejection of the
null. We can conclude, then, that are differences between countries.

3.7. The ANOVA table and the analysis of residuals


The ANOVA table (Table 3.2) is a classical presentation of the F test, which illustrates the steps
to be followed in order to obtain the F value. It is based on the decomposition SST = SSB + SSW,
which on the left side has the total sum of squares,
X 2
SST = xij x .
i,j

A number of degrees of freedom is assigned to each sum of squares. Roughly speaking, it is the
number of independent terms in the sum. Since y is the mean of all the observations yij , SST has
n 1 degrees of freedom. In SSB, there are k different terms, but the sum of the k deviations yi y

[ST-INF] Statistical inference / 23 20140901


is zero, so that SSB has k 1 degrees of freedom. Finally, the contribution of the ith group to
the within-group sum of squares has ni 1 degrees of freedom, and, hence, the number of degrees
of freedom for SSW is (n1 1) + + (nk 1) = n k. Thus, the n 1 degrees of freedom are
splitted, k 1 going to SSB and n k to SSW.
TABLE 3.2. 1-way ANOVA table

Sum of Degrees of Mean


Source squares freedom square F statistic P-value

Between-groups SSB k1 MSB F P


Within-groups SSW N k MSW

Total SST N 1

Next, a mean square (MS) is calculated, dividing the sums of squares by their respective numbers
of degrees of freedom. The F statistic is the ratio

MSB
F = .
MSW

Note that the within-group mean square MSW is equal to the pooled variance s2 . In one-way
ANOVA, the conditions for the validity of the F test are the same as in the two-sample t test. The
data set is partitioned into k groups, which are assumed to be independent samples of N (i , 2 )
distributions i = 1, . . . , k. Whether these assumptions are acceptable is usually checked through
the residuals. In one-way ANOVA, the residuals are the deviations with respect to the group
means, eij = xij xi . If the one-way ANOVA assumptions were valid, the residuals should look
as a random sample of the N (0, 2 ) distribution, and this is what we check in practice. The
assumption that the variance is the same for all samples is called homoskedasticity. This topic will
be discussed in depth in the econometrics course.
TABLE 3.3. ANOVA table (Example 2.2)

Sum of Degrees of Mean Significance


Source squares freedom square F value level

Between-groups 5.217 2 2.609 3.58 0.029


Within-groups 305.6 420 0.728
Total 310.8422 0.736

Example 2.2 (continuation). The ANOVA table corresponding to Example 2.2 (Table 3.3), can be
obtained in Stata with the command oneway, but it does not provide the residuals. The command
anova is much more powerful, since it allows for more complex forms of analysis of variance. Note
that, in anova, the variable that defines the groups must be coded as a numerical variable. If it is
not so, in your data set, it can be easily changed with the command encode. Also, anova allows
postestimation commands.
A postestimation command is one that profits the results of the last estimation command to
produce additional results, not included in the initial output. Probably, predict is the most
widely used of these commands. The option residuals produces here a new variable containing
the ANOVA residuals. You can see the histogram and the normal probability plot in Figure 3.5.
The skewness is 0.5. So far, the normality assumption is not clear at all, but we wouldnt worry
about the validity of the conclusion, since, with such samples sizes, the F test is safe enough.

[ST-INF] Statistical inference / 24 20140901


100

2
1
80

0
60

1
40

2
20

3
0

3 2 1 0 1 2 3 2 1 0 1 2 3

Figure 3.5. Distribution of the residuals (Example 2.2)

3.8. Homework
A. Reanalyze the data of Example 3.1 after the wages putting back in the original scale (no
logs). How does the interpretation of the mean difference change?
B. Draw 250 independent random samples of size 5 from N (0, 1) and calculate the sample
variance for each sample. The same for N (1, 1). Divide the first by the second, getting 250
F statistics and plot a histogram. Compare this histogram with Figure 3.4.
C. The data set for this exercise comes from the same study as Example 7.2, but includes data
from nine countries. Test the differences among countries using the methods of this chapter.

4. Simple linear regression

4.1. The regression line


This section is a refresher of a math classic: given a set of points (x1 , y1 ), . . . , (xn , yn ), we search
for the line y = b0 + b1 x that fits these points in the best way. Of course, this is not operational
unless we specify what do we mean by fitting well. All methods agree on the same idea: a good
fit means that the residuals ei = yi (b0 + b1 xi ) are small. This course covers only the ordinary
least squares (OLS) method, in which the coefficients are found by minimizing the residual sum of
squares X
SSR = e2i .

Taking SSR as a function S(b0 , b1 ) of the coefficients, this is an unrestricted optimization problem
with a quadratic objective function, which is easy to solve through differential calculus. The
gradient vector and the Hessian matrix are, respectively,
" # " #
n(y b0 b1 x) n nx
2
S(b0 , b1 ) = 2 P P , S(b0 , b1 ) = 2 P 2 .
xi yi nb0 x b1 x2i nx xi

It is easily seen that X


2
det 2 S(b0 , b1 ) = 4n xi x > 0,

[ST-INF] Statistical inference / 25 20140901


and, hence, that the objective function is convex. So, the solution of the equation S(b0 , b1 ) = 0
can have at most one solution, which will be the minimum sought. With a bit of algebra, the
solution can be written as
P
xi x yi y
b1 = P 2 , b0 = y b1 x.
xi x

The line of equation y = b0 + b1 x is called the regression line (of y on x). b0 and b1 are the
regression coefficients: b1 is the slope and b0 the intercept or constant. It follows from the formula

of the intercept that y = b0 + b1 x, meaning that the regression line crosses the average point x, y .
This is equivalent to the sum (and the mean) of the residual being equal to zero. A third way of
saying the same thing is writing the equation of the regression line as y y = b1 x x .

4.2. The R2 statistic


The residual sum of squares SSR is a measure of the quality of the fit. But, in practical statistical
analysis, is difficult to interpret, since it depends on the units of y and the number of points. So,
we define a standardized goodness-of-fit statistic as follows.
With a bit of algebra, we can work out, from the formula of the intercept, that the identity
yi y = b1 (xi x) + ei is an orthogonal decomposition,

y1 y x1 x e1
y2 y x2 x e2
. = b1 . + . .
. .. ..
.
yn y xn x en

Then, Pythagoras theorem gives us an ANOVA decomposition


n
X X
2 2 X 2
yi y = b21 xi x + ei ,
i=1

which can be written, in short, as SST = SSE + SSR. Here, SSE the sum of squares explained by
the regression. We rewrite the ANOVA decomposition as

SSE SSR
R2 = , 1 R2 = .
SST SST

R2 is called the R-squared statistic (also coefficient of determination). it is also referred to as the
percentage of the variation explained by the regression. To understand how to use this decomposi-
tion, look at the extreme cases. If R2 = 0, then b1 = 0, the regression line is the horizontal line of
equation y = y and the residuals are the deviations with respect to the mean, ei = yi y. There
is no fit at all. When R2 = 1, all the residuals are null, meaning that the points (x, y) are aligned
and the fit is perfect. These two extremes are never found with real data and, in practice, we take
the proximity of R2 to 1 as an indication of good fit.

4.3. Regression and correlation


The formulas related to the regression line become more compact if we introduce standard devia-
tions and correlations, which make sense if we take the set of points as a bivariate sample. Dividing
by n 1 in the expression of the slope, we get
sxy sy
b1 = =r .
s2x sx

[ST-INF] Statistical inference / 26 20140901


90000
70000
SALARY

50000
30000

0.7 0.8 0.9 1.0 1.1 1.2 1.3

MARKET

Figure 4.1. Regression line (Example 4.1)

This tells us that the sample correlation is a standardized regression slope. If x and y have unit
variance, the slope is equal to the correlation. Also, with a bit of algebra, it is proved that R2 is
the square of the correlation, justifying the use of the letter R.
Some prefer to write the equation of the regression line as y y = b1 (x x), avoiding the intercept.
Finally, writing the equation of the regression line as
y y x x
=r ,
sy sx

we see that, if both variables are standardized, the slope coincides with the correlation and the
intercept is zero. The former is no longer true in multiple regression, where standardized regression
coefficients are not correlations, although, in most cases, they look as if they were.

Example 4.1. The data for this example come from a study on the salaries of academic staff
in Bowling Green State University. This data set has been used in several textbooks and can be
considered as a standard example. The sample size is 514. We use here the variables: (a) SALARY,
academic year (9 month) salary in US dollars, and (b) MARKET, marketability of the discipline,
defined as the ratio of national average salary paid in the discipline to the national average across
disciplines.
It is natural here to take MARKET as the independent variable, fitting a regression line to the
data to produce an equation for predicting the salary from the marketability. Denoting them as x
and y, respectively, we get

x = 0.948, sx = 0.149, y = 50,863.9, sy = 12,672.7, r = 0.407,

hence
12,672.7
b1 = 0.407 = 34,545.2, b0 = 50,863.87 34,545.2 0.948 = 18,097.0.
0.149

We have thus obtained the equation of the line for the regression of SALARY on MARKET,

SALARY = 18,097.9 + 34,545.2 MARKET.

We can see in Figure 4.1 a scatterplot of these data, with the regression line superimposed.

[ST-INF] Statistical inference / 27 20140901


Source: JS Balzer, N Boudreau, P Hutchinson, AM Ryan, T Thorsteinson, J Sullivan, R Yonker
& D Snavely (1996), Critical modeling principles when testing for gender equity in faculty salary,
Research in Higher Education 37, 633658.

4.4. The simple linear regression model


In the preceding paragraphs, the discussion was pure math. Now, we turn to statistical analysis,
to see how can we exploit this in inferential terms. The inference relies, as usual, on distributional
assumptions. Although you may frequently hear that regression is safe when the variables are
normal, such a strong assumption is not really needed. It is true that if X and Y are jointly
normal, the conditions for inference are met, but the reciprocal is not, and in most of the regression
analyses that you may find in research papers, some of the variables involved are discrete, for
instance dummy variables (see exercise C) introduced by the analyst to control for things like
gender, seniority or industrial sector.
The simple linear regression model can be presented in two ways:
As a conditional distribution assumption. We assume that, conditional to X, the distribu-
tion of Y is normal, with

E Y |X = 0 + 1 X, var Y |X = 2 .

Here, 0 , 1 and 2 are the parameters of the model, on which the inference is to be done.
As a regression equation. We assume that

Y = 0 + 1 X + , N (0, 2 ), E |X = 0.

You can easily derive the first formulation from the second one by fixing
X and
taking expectation
and variance, and the second from the first one by putting = Y E Y |X . But let me insist on
the main assumptions, since their failure provides much of the motivation for alternative methods
in the econometrics course:
Linearity. Although most analysts apply linear regression methods taking linearity as
granted, nonlinearity issues must be taken into account. In many cases, they are easily
fixed.

Uncorrelated error term. E |X = 0. It can be proven, using the properties of the
conditional expectation, that this implies that X and are uncorrelated, although both
things are not the same. Nevertheless, under joint normality, this property, uncorrelat-
edness and independence are the same.
Homoskedasticity. The fact that the error variance is constant receives this pitturesque
name. When this assumption fails, we say that there is heteroskedasticity. Methods
for dealing with heteroskedasticity will be introduced in the econometrics course.

Note that there are no assumptions on X in the linear regression model.

4.5. Estimation
It follows easily from the assumption of the linear regression model that

cov[X, Y ]
1 = , 0 = E[Y ] 1 E[X].
var[X]

We can obtain moment estimators for these parameters replacing means, variances and covariance
by their sample versions,
sxy
b1 = 2 , b0 = y b1 x.
sx

[ST-INF] Statistical inference / 28 20140901


But these are precisely the slope and the intercept of the regression line. Hence, the moment esti-
mators are given by the OLS formulas of the least squares. We call them the ordinary least squares
(OLS) estimators. Why calling this ordinary will be clear when you learn about generalized least
squares (GLS) estimation in the econometrics course. The estimate of the error variance 2 is then
n
1 X 2
s2 = e .
n 2 i=1 i

The square root s is sometimes reported as the standard error of the regression. These estimators
are unbiased (if we want s2 to be unbiased, the denominator must be n 2). With a bit of algebra,
we get X
2 x2i 2
var[b0 ] = X , var[b1 ] = X .
n (xi x)2 (xi x)2

Estimates of the standard errors are obtained by replacing 2 by s2 . Inference (confidence limits
and testing) on the regression coefficients is based on these estimated standard errors. Mind that
the variances are conditional to the values of X that we have in our data set. This is evident
from the fact that these values are involved in the formulas. The variance of b1 can be reduced
by augmenting the variation of X. This is used in experimental design to improve the precision of
the slope estimates.

4.6. Testing
If the error term is normally distributed, we can calculate 12 confidence limits for the regression
coefficients, putting b t (n 2) se[b]. Standard errors can also be used to run a t test. For the
null H0 : i = 0, we use
bi
t= ,
se[bi ]
with df = n2. In the computer, the standard regression output contains, besides every parameter
estimate, the standard error, the t statistic and the P -value (Table 4.1). The report also includes
the ANOVA decomposition of paragraph 4.2. It can also be seen that the F statistic
SSE (n 2)R2
F = n2 =
SSR 1 R2
is the square of the t statistic associated to 1 , so that it is redundant here (this is no longer true
in multiple regression). The second formula, involving R2 , shows that the significance of the slope
occurs when R2 is close enough to 1. We say that R2 is significant, or that the correlation is
significant, when the F statistic is so. It is also obvious, from the formula, that weak correlations
can be significant with samples big enough.
A noteworthy detail about the coefficient estimators b0 and b1 is that they are correlated. This
means, in practice, that it is not correct to make inferences about both coefficients separately, as
we do with the mean and the variance (which are uncorrelated). We skip the formula here, because
the right setting for the discussion is that of the general regression model, where we find a formula
for obtaining the covariance matrix of the regression coefficient estimators at once. Nevertheless,
this issue is lightly touched in the homework (exercise E).
The analysis of the residuals is useful for checking the validity of the model. This analysis is
similar to that of the residuals of the one-way ANOVA, plus the possibility of a residual plot,
where we place the residuals (standardized or not) in the ordinates, and xi , the predicted values
yi = b0 + b1 xi or the order in which the data were obtained in the abscissas.
What can be done without the normality assumption? Practically the same, for big samples.
Indeed, it can be shown that the OLS estimators of the regression coefficients are asymptotically
normally distributed, so that most of the previous discussion remains valid.

[ST-INF] Statistical inference / 29 20140901


10000 20000 30000 40000
80
60
Frequency

Residuals
40

0
20

20000
0

20000 0 10000 30000 3 2 1 0 1 2 3

Residuals Standard normal quantiles

Figure 4.2. Distribution of residuals (Example 4.1)

Example 4.1 (continuation). In Stata, we obtain a standard regression report with the command
regress. This report contains a table with the coefficient estimates (Table 9.1), an ANOVA table
and some additional results. In the coefficients table, we find the coefficient estimates, the standard
errors, the t statistic and the P -value.
TABLE 4.1. Linear regression results (Example 4.1)

SALARY Coefficient Std. error t statistic P -value


MARKET 34,545.2 3,424.3 10.09 0.000
Constant 18,097.0 3,288.0 5.50 0.000

Apart from the ANOVA table, we find F (1, 512) = 101.8. This is the square of the t statistic
for MARKET (t = 10.09), with the same P -value. R2 = 0.166 appears below. The adjusted R2
statistic, used in multiple linear regression to compare models with different number of independent
variables, is irrelevant here.
The covariance matrix of the coefficient estimators is not reported, but Stata saves it as e(V). This
means that, naming this matrix, we can use it in any calculation. In this example, we get (the
order is the same as in Table 4.1)

11,726,058
cov[b]
= .
11,122,417 10,811,002

From this, we can estimate the correlation, cor[b


1 , b0 ] = 0.988. Such a strong correlation will
probably surprise you, but you need more experience for dealing with this issue, so that I leaves
for the econometrics course. The residuals can be obtained with the predict postestimation
command, as in ANOVA (section 3). The diagnostic plots of Figure 4.2 reveal a clear departure
from normality in the left tail. But, given the sample size, nonnormality is not a problem here.

4.7. Homework

A. Because of the skewed distribution of variables such as salaries, sales or size, econometricians

[ST-INF] Statistical inference / 30 20140901


introduce them in log scale in linear models. Rerun the analysis of Example 4.1 replacing
SALARY by log(SALARY) and compare the results with those obtained in paragraph 4.6.
How do you interpret the coefficients in both cases?
B. An ANOVA approach can also be used to test the influence of marketability on salary. To
do it, divide the sample of Example 4.1 into three groups that may be taken as low, medium
and high marketability and test the difference in mean (log) salaries as in section 3.
C. Split the sample of Example 4.1 into two groups that may be taken as low and high mar-
ketability and test the difference in mean (log) salaries with a t test. Then code the two
groups with a dummy (1= high, 0=low) and a linear regression of SALARY on this dummy.
Compare the results of the two analyses.
D. Although the idea of the financial performance of a firm may seem obvious, there is no
consensus on how to measure it. Two well known measures are the return on equity (ROE)
and the return on assets (ROA). The ROE measures a firms efficiency at generating profits
from every unit of shareholders equity. The ROA tells us how profitable the firms assets
are in generating revenue, or, more specifically, how many dollars of earnings it derives from
each dollar of assets it controls. To explore the extent to which these two measures provide
equivalent information about firms, a data set, covering a wide range of industries, has
been prepared. It contains the ROE and the ROA of 426 firms for the year 2000, derived
from public sources. The ROE has been calculated as net income over total equity, and the
ROA as operating income over total assets. Perform a regression analysis. Which is your
conclusion?
E. Generate 1,000 independent samples of size 10 of (X, Y ), with X N (2, 1), Y = 3 + X +
and N (0, 0.04). Fit a regression line to each sample. Save the coefficients obtained
and examine the joint distribution of the slope and the intercept. You may use the Stata
program:
* Definition
program define h4e, rclass
syntax [, size(integer 1) ]
drop all
set obs size
tempvar x y
generate x = 2 + rnormal()
generate y = 3 + x + 0.2*rnormal()
quietly regress y x
matrix b = e(b)
return scalar b1 = b[1,1]
return scalar b0 = b[1,2]
end
* Execution
simulate b1=r(b1) b0=r(b0), reps(1000): h4e, size(10)
Redo the exercise with X N (0, 1). Explain the different results obtained.

5. More on testing

5.1. The alternative hypothesis


Hypothesis testing was presented in section 3 as applied to the rejection of a null hypothesis given
by one or several equalities. Nevertheless, most textbooks relate it to a choice between the null

[ST-INF] Statistical inference / 31 20140901


and an alternative hypothesis H1 . This is consistent with section 3 if we take H1 was the negative
of H0 .
Take, for instance, the one sample t test of Example 3.1. As the test was presented, H1 : 6= 0.
This is a bilateral or two-tail test, in which significant results would lead us to conclude that the
unknown parameter is not equal to the hypothesized value. But, if we have decided in advance
that we want to prove > 0, we specify this as the alternative hypothesis H1 . Then, we set the
null to H0 : 0. This is one-tail test.
Some textbooks leave in this case the null as = 0, so that both hypotheses are no longer
complementary. This is less elegant, from the theoretical point of view, but quite frequent. Anyway,
for us this is just a formalism, since what matters is the reduction in the P -value (one half) that
we get switching from two- to one-tail testing.
Imagine that, in Example 2.2, we hypothesize that the mean satisfaction is higher in Mexico than
in Chile. We sample both populations and calculate the means. If the mean is higher in Chile than
in Mexico, we fail in proving our hypothesis. If it is higher in Mexico, we calculate the t statistic
as in paragraph 3.2, but the P -value is the area of one tail, that is, one half of that of the two-tail
test. Thus, we get more significance with one-tail tests, which makes them attractive to authors,
but suspicious to referees.
Halving the P -value makes the one-tail test controversial. So, it is rarely seen in management
science. Nevertheless, you may find examples in other fields, like pharmaceutical research, where
one may try to show that a new product is better (not just different) than the reference product.
So, for mean differences and correlations, that can have positive or negative sign, two-tail testing
is the rule. This is so in spite of the fact that most the hypotheses declared in management
science research papers are frequently phrased in terms of either a positive or negative association
(meaning correlation) or effect (a regression coefficient).

5.2. Test design


This paragraph presents a brief description of how to design a test. Suppose that the null is related
to an unknown parameter . Then:
We choose a test statistic G = g(X1 , . . . , Xn ), such as a t or an F statistic.
We choose a critical value c and specify the rejection of H0 when G c.

For a given value of , the power of the test is the probability of rejection p G c . How
the power changes as a function of is usually presented as a curve. A good test should
have low power when is in the range covered by H0 and high power when it is in that of
H1 .
The type I error is the error of rejecting the null when it is valid. We denote by the
maximum probability of a type I error, that is, the maximum
power under H0 . Example:
in the two-tail t test of paragraph 3.2, = p G c | H0 .
The type II error is the error of not rejecting the null when it is false. The maximum
probability of a type II error is denoted by .
A test is chosen in such a way that and are as small as possible. This issue has already been
addressed by the designers of the tests recommended in statistics textbooks and software, and
users profit from this. The commonest strategy is to fix < 0.05 (95% confidence) and search for
a test with as low as possible.

Example 5.1. We use simulation to calculate the power of a test of the null = 0 for a normal
with known = 1 based on the statistic
X
Z= .
S/ n

We assume = 0.5 and perform the simulation for n = 5 and for n = 10. We draw first 10,000

[ST-INF] Statistical inference / 32 20140901


samples of size 5 from a N (0.5, 1) distribution. The null is rejected in 1922 cases, a proportion
of 19.22%. This is an approximation of the power. Is this so low because the test is deficient?
No, it is so because the sample is small, given that the difference is only one half of the standard
deviation. We try again, but doubling the sample size. Now the proportion of rejections increase
up to 40.86%. Try to reproduce this experiment by yourself.

5.3. Chi square testing


The Pearson 2 test is a classic that can be applied in a variety of situations. Briefly speaking, the
2 statistic is a measure of the agreement between the data and a theoretical model. It is obtained
as a sum of terms related to squared differences between some observed values, ie derived from the
data, and the corresponding expected values, as given by the model. Where the calculation of the
expected values involves an unknown parameter, we use an estimate. Assuming that the model
is valid, the 2 statistic has a 2 distribution, whose number of degrees of freedom is related to
the number of terms in the sum that defines the 2 statistic. More specifically, it is equal to the
number of terms minus the number of parameter estimates used to calculate the expected values.
The formula of the Pearson 2 statistic has a simple structure, involving the sum of as many
summands as cases being considered. Each of the summands has the structure (O E)2 /E, where
O = OBSERVED and E = EXPECTED. This statistic can be used for testing distributional
assumptions, typically normality. Nevertheless, it is too conservative, so, even if many elementary
textbooks describe it, it is rarely used today in practical research. We will see better alternatives
in paragraph 5.5.

5.4. The chi square test of independence


A variant of the Person 2 test that still survives applies to data in a contingency table format.
These tables result from the cross-tabulation of two categorical variables. Table 5.1 illustrates the
structure of a contingency table, with row categories A1 , . . . , Ar and columns categories B1 , . . . ,
Bc . To simplify the notation, we assume that the number of rows is less or equal than the number
of columns (r c). The data can come as frequencies, or as proportions.
TABLE 5.1. Contingency table (frequencies)

B1 Bj Bc Total

A1 n11 n1j n1c n1+


.. .. .. .. ..
. . . . .
Ai ni1 nij nic ni+
.. .. .. .. ..
. . . . .
Ar nr1 nrj nrc nr+
Total n+1 n+j n+c n

In Table 5.1, nij is referred to the joint occurrence Ai Bj , the row total ni+ to the occurrence of
Ai and the column total n+j to Bj . The proportions are

nij ni+ n+j


pij = , pi+ = , p+j = .
n n n

The null hypothesis of an independence test is that the two categorical variables, rows and columns,
are statistically independent. This means that the product formula ij = i+ +j holds for every
cell. In the test, we take pij as the observed proportion and pij = pi+ p+j as the expected

[ST-INF] Statistical inference / 33 20140901


proportion. The test statistic is

X pij pij 2
2
X =n .
i,j
pij

The degrees of freedom are here r 1 c 1 . The differences pij pij , adjusted to the size of
the marginal proportions (this is similar to standardizing the covariance by transforming it into a
correlation),
pij pij
aij = p ,
pij
are called residuals. So, the 2 statistic is the sum of the squared residuals multiplied by the
sample size n. The absolute values of the residuals can be sorted, to identify the cells that are
more influential in the test or, equivalently, those showing the biggest disagreement with the
model. The sign tell us if the observed frequency is above or below the expected value (assuming
independence). These residuals are also used in correspondence analysis (multivariate statistics
course).
There is an alternative 2 test for contingency tables, called the 2 likelihood ratio test. The test
statistic is, now,
X pij
G2 = 2 n pij log ,
pij
and P -value is given by the same 2 distribution. This test is based on maximum likelihood
estimation, not covered in this course. The results are very similar to those of the Pearson test.

Example 5.2. Table 5.2 shows fictional data on the purchases of products, A, B and C. The sample
is partitioned by age: young adults (1835), middle aged (3655) and senior (56 and older).
TABLE 5.2. Contingency table (frequencies)

Group A B C Total

Young adults 20 20 20 60
Middle age 40 10 40 90
Senior 20 10 40 70

Total 80 40 100 220

Table 5.3 is the corresponding table of proportions, obtained by dividing each frequency by the
total sample size. Expected proportions have been included in the table, in parenthesis.
TABLE 5.3. Contingency table (proportions)

Group A B C Total

Young adults 0.091 (0.099) 0.091 (0.050) 0.091 (0.124) 0.273


Middle age 0.182 (0.149) 0.045 (0.074) 0.182 (0.186) 0.409
Senior 0.091 (0.116) 0.045 (0.058) 0.182 (0.145) 0.318

Total 0.364 0.182 0.455 1

Table 5.4 contains the residuals for the proportions of Table 5.4. The 2 statistic is

X 2 = 220 (0.026)2 + (0.186)2 + = 17.64 (df = 4, P = 0.001).

The likelihood ratio test statistic is



2 0.091 0.091
G = 440 0.091 log + 0.091 log + = 16.60 (P = 0.002).
0.099 0.050

[ST-INF] Statistical inference / 34 20140901


TABLE 5.4. Residuals (proportions)

Group A B C

Young adults 0.026 0.186 0.094


Middle age 0.86 0.106 0.010
Senior 0.073 0.051 0.098

The same test can also be used to test the homogeneity of the probability distribution across
populations, that is, a null H0 : 1j = . . . = rj , where j = 1, . . . , c are the populations. This
is called a homogeneity test. The data can be presented as in Table 5.2, the only difference being
that the row totals n1+ , . . . , nr+ , that will correspond in this case to the sizes of the samples
drawn from the populations compared, will be specified in the data collection design.

5.5. Normality testing


In this paragraph, I present a short survey of normality testing. The first consideration to be made
is that there are no normality tests, properly speaking, that is, there is no test whose null hypothesis
is the whole normal distribution. What we really test is a certain trait of the normal distribution,
which, in the case of the 2 test, consists in a the of probabilities of a set of prespecified intervals.
The second consideration is that, depending on the particular type of departure from the normal
they are concerned with, researchers in different fields favor this or that normality test.
The Pearson 2 test, as a normality test, is too conservative, failing to reject the null in many
cases of gross departure from the normal. The Kolmogorov-Smirnov (KS) test, equally simple and
more powerful, has been for years the most popular normality test. But, since it is not (directly)
provided by Stata, it is not covered here. Nevertheless, Stata provides the popular Shapiro-Wilk
test, based on the correlation of the normal probability plot (command swilk). I do not give
details here, but this command is very easy to use.
The sampling distributions of the psample skewness
p and kurtosis are asymptotically normal and
independent, with standard errors 6/n and 24/n, respectively. The P -values can be obtained
by taking the ratios of these statistics by their respective standard errors as values of a z test
statistic. The Jarque-Bera (JB) test, quite popular in econometrics, is a normality test based on
the statistic

n 2 K2
JB = Sk + .
6 4

Under normality, JB is asymptotically 2 (2) distributed. Instead of the Jarque-Bera test, Stata
provides an alternative test, also based on the skewness and the kurtosis (command sktest).

Example 1.2 (continuation). The sample skewness and kurtosis of the Brazil returns are, respec-
tively,
Sk = 0.193, K = 0.842,

with standard errors


se[Sk] = 0.152, se[K] = 0.303.

Although we obtain thus a nonsignificant z value for the skewness, z = 1.273 (P = 0.203), the
kurtosis is highly significant, z = 2.776 (P = 0.005). Therefore, we reject the normal distribution,
due to the significant positive kurtosis found in the data. The Jarque-Bera statistic, JB = 9.330
(P = 0.009), leads us to the same conclusion. Less sharp results are given by the Kolmogorov-
Smirnov (D = 0.039, P = 0.824) and Shapiro-Wilk tests (W = 0.9901, P = 0.0715).

[ST-INF] Statistical inference / 35 20140901


5.6. Sign tests
The tests of section 3 are valid under normality assumptions, and asymptotically valid without
them. In general, a nonparametric test is one in which it is not assumed that the distribution of the
variables involved is of a particular type. Our first example of a nonparametric test, the sign test,
based on the binomial distribution, is an alternative to the one-sample t test (and consequently to
the paired data t test), with no distributional assumption.
Suppose a continuous distribution with median and a sample of size n from this distribution,
containing no zeros. We wish to test the null H0 : = 0 (for = 0 , it suffices to subtract 0 and
perform the test as presented here). Calling B + the number of positive observations and B the
number of negative observations, the test statistic is B = max(B + , B ). The test is based on the
fact that, under the null, the probability of a positive result is 0.5. Then, under the null, B + and
B have a B(n, 0.5) distribution. The P -value of the two-tail test (H1 : 6= 0) is the double of
the probability of the right tail associated to the actual value of B in B(n, 0.5) distribution. Some
stat packages report asymptotic P -values, derived from a normal approximation.

Zero observations are discarded in this test. This is not relevant as far as the continuity assump-
tion, under which repetition is not expected, is tenable.

Example 3.1 (continuation). In Example 3.1, we find 443 cases in which the wages have been
increased. The (two-tail) P -value can be obtained in Stata with the binomialtail function,
which gives P < 0.001. This is consistent with the outcome of the t test. The asymptotic P -value
would be based on the N (272.5, 272.5) distribution. This test can be directly performed in Stata
with the command signtest.

5.7. The Wilcoxon signed rank test


The Wilcoxon signed rank test is a second alternative to the one-sample t test. The distributional
assumptions are the continuity and the symmetry with respect to the mean. We test, as in section
5.6, the null = 0. The test statistic is obtained as follows:
We sort the observations by absolute value. Let us assume first that there are no ties.
We assign ranks. The first observation gets 1, the second 2, etc.
Calling T + and T the sum of ranks of the positive and negative observations, respectively,
we have T + + T = n(n + 1)/2. In the version provided by Stata, the test statistic is T + ,
but many textbooks use T = max(T + , T ) to simplify the use of the tables.
Under the null, T+ has a symmetric (discrete) distribution, with mean and variance given by

n n+1 n n + 1 2n + 1
E[T+ ] = , var[T+ ] = .
4 24

If there are ties in the absolute values, they get an average rank. The variance must be then
corrected. Stata command signrank provides a correction for this case. To get exact significance
levels for this test and those which follow, one should look at the corresponding tables or use a
special package. What we usually find in a generalist stat package is an asymptotic significance
level. The difference may be relevant for small sample studies (eg in biostatistics), but not the
sample sizes that we usually find in econometrics. In fact, the tables that we find in textbooks do
not go beyond n = 20. Asymptotic P -values are based on a normal approximation whose mean
and variance are given by the above formulas. Stata reports the z value associated to T+ (i.e.
subtracting the mean and standard deviation given by the above formulas).

Example 3.1 (continuation). With the Stata command signrank, we get z = 15.9 (P < 0.001).

[ST-INF] Statistical inference / 36 20140901


5.8. The rank-sum test
The Wilcoxon two-sample rank-sum test is an alternative to the two-sample t test that only re-
quires that the distributions compared are continuous and of the same type. So, it applies to
two independent samples, of sizes n1 and n2 , respectively. To simplify the notation, we assume
n1 n2 . There are two equivalent versions, that of Wilcoxon, presented here, and that of Mann
and Whitney, sometimes called Mann-Whitney U test. The test performs a comparison of two
PDFs related by an equation
f2 (x) = f1 (x ).

= 1 2 is sometimes called the treatment effect. The null is = 0. The test statistic W is
obtained as follows:
The two samples are merged, and the the resulting sample (size n1 + n2 ) is sorted.
We assign ranks to the observations, averaging ties.
W is the sum of the ranks of the first sample.
Under the null, W has a symmetric (discrete) distribution, with

n1 n1 + n2 + 1 2 n1 n2 n1 + n2 + 1
= , = .
2 12

As in the signed rank test, exact significance levels are usually extracted from tables, but only for
small samples. For n2 > 10, asymptotic levels are accepted.

Example 2.2 (continuation). To compare Chile and Mexico as in section 3, we use the Stata
command ranksum, getting z = 2.224 (P = 0.026), similar to the t test.

5.9. The Kruskal-Wallis test


The Kruskal-Wallis test is an extension of the rank-sum test to k independent samples, just as the
one-way ANOVA F test is an extension of the two-sample t test. The assumptions are as in the
rank-sum tests. The test statistic is obtained as follows:
The samples are merged and the resulting sample is sorted, assigning ranks.
The test statistic is
Xk
12 Ri2
H= 3(n + 1),
n(n + 1) i=1 ni

where Ri is the sum of the ranks of sample i and n is the total sample size (n = n1 + +nk ).
For ni 5, the distribution of H can be approximated by a 2 (k 1).

Example 2.2 (continuation). To compare the three countries, we use the command kwallis. This
gives 2 (2) = 9.46 (P = 0.009), more significant than in Table 3.3.

5.10. Homework
A. The results of Table 5.5, where companies have been classified according to their activity in
two sectors, Production and Services, come from a study on work-family conciliation. The
columns of the table correspond to the responses to the question:
Are all the managers in your company concerned with work-family balance?
Test the effect of the type of activity on the concern with work-family balance using a chi
square test.

[ST-INF] Statistical inference / 37 20140901


Source: I Alegre, N Chinchilla, C Leon & MA Canela (2007), Polticas de conciliacion,
liderazgo y cultura en 2200 PYMES espanolas, ICWF, IESE Business School, Estudio
50.

TABLE 5.5. Work-family balance (Exercise A)

None Some Majority All Total

Production 218 725 790 260 1993


Services 179 707 1070 305 2261

Total 397 1432 1860 565 4254

B. Apply a normality test to the residuals of Example 4.1.


C. Test the normality of the Mexico group in Example 2.2.
D. Rerun the analysis of exercise C of section 3, using a nonparametric test.
E. In this exercise, you simulate the sampling distribution of the sample skewness and kurtosis
with sample size n = 100. To do it, generate a respectable number of normal samples of
this size, and calculate the skewness and the kurtosis of each sample, saving them as two
variables in a separate data set. You may use the Stata program:
* Definition
program define h5e, rclass
syntax [, size(integer 1) ]
drop all
set obs size
tempvar z
generate z = rnormal()
quietly summarize z, detail
return scalar sk = r(skewness)
return scalar ku = r(kurtosis) - 3
end
* Execution
simulate sk = r(sk) ku = r(ku), reps(5000): h5e, size(100)

Examine the distributions


p ofpthe sample skewness and kurtosis. Did you get the expected
standard errors ( 6/n and 24/n, respectively)?

[ST-INF] Statistical inference / 38 20140901

You might also like