Professional Documents
Culture Documents
ST Inf
ST Inf
ST Inf
Miguel-Angel Canela
Contents
1. Sampling .............................................................................. 1
2. Parameter estimation .................................................................. 8
3. Testing means and variances .......................................................... 17
4. Simple linear regression ............................................................... 25
5. More on testing ...................................................................... 31
1. Sampling
1.1. Foreword
In this second part of the course, we turn to data analysis and, more specifically, statistical infer-
ence. It follows loosely the DeGroot & Schervish (2002) textbook, changing the order of appearance
only in special cases, as for simulation, but reducing the scope of the theoretical discussion here
and there, specially in estimation and testing. It differs from other introductory courses in that
multiple regression is not covered and analysis of variance is restricted to one-way ANOVA. Full
treatment of these topics is left for the econometrics course.
Computation is expected to be done in Stata. We indicate with special typeface Stata code, such
as in regress. Since the students of this course are expected to perform their statistical analyses,
they may consider using texts linked to Stata that contain moderate doses of each topic covered.
I recommend Hamilton (2009), a popular lightweight.
These notes are complemented with data sets and scripts. The scripts contain the commands
used in the examples. The data sets are in Stata format (extension .dta). They can be opened
by double-clicking or from Stata with the use command. All the material for the course can be
downloaded from http://blog.iese.edu/mcanela/mrm.
20
10
0
Days
A histogram is a (vertical) bar diagram in which the bars are based on intervals of values of the
variable whose distribution is examined. The height of the bars is proportional to their frequencies.
The scale of the vertical axis can be set in terms of frequencies (counts) or proportions. In the
Stata command histogram, proportions are the default option. The upper sides of the rectangles
of a histogram can be seen as an approximation to the density curve. The histogram can be thus
compared to the density of the candidate model. This is the theory, but, in practice, what we see in
a histogram depends on the choice of the intervals, specially in small samples. I would recommend
beginners to start with no more than 58 intervals whose extremes are round numbers.
Example 1.1. The strike duration data given by Kennan (1985) are frequently used to illustrate
duration data modelling in econometrics courses. They give the duration, in days, of 62 strikes
that commenced in June, from 1968 through 1976, each involving at least 1000 workers and began
at the expiration or reopening of a contract. The histogram (Figure 1.1), looks like an exponential
distribution.
Quantile-quantile (QQ) plots are scatterplots, in which the two axes correspond to quantiles of a
distribution. This presentation is restricted to a special QQ plot for the normal distribution, the
normal probability plot (command qnorm in Stata). It matches an empirical distribution, not to
an individual distribution, but to the whole normal distribution model.
The normal probability plot is based on the fact that there is a linear relationship between a normal
variable and the N (0, 1) distribution. Suppose that we have a sample of independent univariate
observations x1 , x2 . . . , xn . We put x(i) on one axis and the N (0, 1) quantile zi = 1 (i/(n + 1))
on the other axis. Then, if the data were extracted from a normal distribution, the n points in the
normal probability plot would be close to a straight line.
Example 1.2. The data set for this example contains the daily returns of the Brazil and Mexico
MSCI indexes. It has been extracted from the DataStream database and covers the whole year
2003, with a total of 261 observations (no data in week-ends). Returns are derived from the index
values as follows. If xt is the value of a particular index at day t, the daily return at this day is
given by rt = xt /xt1 1. The returns used here come in percentage scale.
4
60
2
Daily returns
Proportion
0
40
2
20
4
0
6 4 2 0 2 4 3 2 1 0 1 2 3
We use only the Brazil data. The estimates for the mean and the standard deviation (see para-
graphs 1.3 and 1.4) are = 0.273 and = 1.456. You can see an histogram of the Brazil returns
in the left panel of Figure 1.2. The distribution is not really skewed, but the tails seem to be fatter
than those of the normal distribution. The right panel of Figure 1.2 is a normal probability plot,
including a straight line. The line has been chosen so that it passes through the first and third
quartiles (others fit a regression line). You may find in this graphic the traits already identified in
the histogram. These traits could be predicted from the sample skewness and kurtosis (paragraph
1.4).
This is one example of what in finance is called fat tails, a special pattern of departure from
the normal distribution. Since the normality of the returns was taken for granted in the classical
portfolio theory, the persistent evidence of fat tails found in financial returns data has been
discussed many times. Nowadays, the normality assumption has already been dropped.
3
2
2
1
0
0
1
2
2
3
4 2 0 2 4 4 2 0 2
Figure 1.3. Scatterplots of simulate bivariate normal samples, with = 0.75 and = 0.25
which we do not a have a simple formula. This is seen through the practice.
The only thing that computers really simulate is the uniform distribution in the unit interval.
For instance, 0.5841526, 0.2326198, 0.6901792, 0.8181496 and 0.0532115 are five random numbers,
generated with the function runiform() of Stata (called uniform() before Stata 10.1). The rest
of the distributions simulated are obtained from this by means of diverse transformations, that can
be invented by the user or be available in your software of choice.
Since they are not really uniformly distributed, but only approximately, some call pseudorandom
what we call here random numbers. The distinction is irrelevant at the level of this course.
Stata has command for generating samples for many special distributions (see the manual). The
standard normal can be simulated with the Stata command rnormal(), whose syntax is similar that
of runiform(). With rnormal(m,s), we can specify the mean and the standard deviation (not the
variance). The drawnorm command (this is a command, not a function) can be used for sampling
from a multivariate normal distribution. There are various ways to specify the distribution, through
the options of drawnorm. We have used drawnorm to produce the simulated samples of Figure 1.3,
which correspond to bivariate normal distributions with standard marginals, with = 0.75 and
= 0.25, respectively.
This means that, in the average, the sample mean is right. Since the variance is a measure of the
Note that the independence of the observations has been used here but not in the preceding
argument. From the expression of the variance, we see that it tends to zero as n . This
means that the variation of X becomes irrelevant for big samples. We say that X converges to
as n . This statement, called the law of large numbers, is one of the great theorems of
mathematical statistics.
Although the idea of the law of large numbers should be clear enough to you, a comment on limit
theorems is worthwhile here. Much effort, when writing statistics textbooks, is put on distinctions
among the different types of convergence and on the proofs of limit theorems. Why is this? The
definition of the limit of a sequence of numbers has nothing to hide, because numbers are simple
things, but a random variable carries a lot on its back. What converges, the values of the variables,
the densities, parameters like means and variances . . . ? The fact is that there are different types of
convergence, and developing them here will take more space than allowed. So, the discussion here
is quite short. The law of large numbers is easily proved if we formulate it in terms of convergence
in probability. It can be stated as: for every number > 0,
lim p |X | > = 0.
n
This is expressed, in short, as plimX = . The proof is based on the Chebyshev inequality,
2
p |X | > 2 ,
which is valid for any probability distribution with moments of first and second order, and not
hard to prove. We will come back to limit theorems in the next chapter.
Except for the detail that the denominator is n 1 instead of n (this will be justified in section
2), these two measures are particular cases of covariances, and therefore inherit the properties of
the covariance. Some properties of these formulas can also be derived from the analogy between
the covariance and the product of two vectors. For instance, in the same way that we normalize
a vector by dividing it by the modulus, we can do it with zero-mean variables, dividing by the
standard deviation. This is used to standardize the columns of a data set, producing z-scores.
Another example of this analogy: since the product of unit vectors is equal to the cosine of their
angle, the covariance of two unit variance columns is a cosine, i.e. a number between 1 and 1.
This statistical cosine gives the formula of the correlation that we use in data analysis. For more
than two variables, we can calculate covariance and correlation matrices.
Replacing the numbers xi , yi by statistical samples Xi , Yi , we get the definitions of the sample
covariance,
n
1 X
SXY = Xi X Yi Y ,
n 1 i=1
Figure 1.4. Distribution of the mean of two (left) and five (right) throws of a die
Higher order sample moments, such as skewness and kurtosis, are defined in a similar way. All these
definitions are random variables, which we will use to estimate the parameters of the distribution
sampled (section 2).
Example 1.2 (continuation). In Example 1.2, the sample mean and covariance are
0.273 2.120 0.543
x = , S= .
0.105 0.543 0.934
The normality of the Brazil returns was explored graphically in paragraph 1.1. The sample skewness
and kurtosis are Sk = 0.193 and K = 0.842, respectively, the latter suggesting to discard
normality assumptions.
X1 + + Xn
Xn = .
n
According to the central limit theorem, the CDF of Xn converges to the standard normal as
n . This means, in practice, that we can use the normal as an approximation to calculate
probabilities related to Xn . An equivalent version of the theorem uses the sum instead of the mean,
replacing and by n and n, respectively. Comments about a particular distribution being
asymptotically normal usually refer to the approximation of this distribution by a normal. When
such approximations are available, it is customary to specify whether the true distribution or the
approximation is used. The terms exact and asymptotic are commonly used to this purpose.
The convergence granted by this theorem is not the same as that of the law of large numbers.
Here, we do not say that the variables converge to a limit, but that the distributions do.
One of the great things of the central limit theorem is that there is no restriction on the type of
distribution, as long as it has moments of first and second order. In particular, it can be used for
the continuous approximation to a discrete distribution. To see how this works, look at the plots of
Figure 1.4. For the outcome of a regular die, the distribution is uniform, with probability 1/6. But
the distribution of the mean of two dice (left), is no longer uniform, but triangular. What happens
when we increase the number of dice averaged? That the number of possible values increase, so
that a continuous approximation makes sense, and the PDF gets closer to the bell shape of the
standard normal (right).
By virtue of the central limit theorem, the normal can be seen as the limit of many distributions.
For instance, a B(n, ) variable is the sum of n independent Bernouilli variables. Then,
B(n, ) N n, n(1 ) .
Also, a Poisson variable with = n can be seen as the sum of n independent Poisson variables
with = 1. Then,
P(n) N n, n .
Figure 1.5 illustrates these approximations. On the left, we see the B(100, 0.1) probabilities and the
N (10, 9) approximation. On the right, the P(50) probabilities and the N (50, 50) approximation.
It is worth to remark, with respect to the approximation of the binomial, that the quality of the
approximation depends, not only on n, but also on , improving as gets closer to 0.5.
Finally, the the central limit theorem also applies to the 2 distributions, so that, for a high n, the
2 distribution is asymptotically normal,
1.7. Homework
A. Explore the normality of the Mexico returns in Example 1.2.
B. Simulating a multivariate uniform distribution is easy. For instance, from two samples (same
length) of pseudorandom numbers we simulate the uniform distribution on the unit square
[0, 1] [0, 1]. After discarding the points where x + y > 1, we get a sample of the uniform
distribution of Example 3.13. Use this approach to generate a sample of size 200 and plot
it to convince yourself.
0.05
0.10
0.04
0.08
0.03
0.06
0.02
0.04
0.01
0.02
0.00
0.00
0 5 10 15 20 30 40 50 60 70
Figure 1.5. Normal approximation for the B(100, 0.1) (left) and P(50) (right)
C. Draw a random sample of size 1000 from a trivariate normal distribution, with
1 1
= 2 , = 0.2 1 .
1 0.8 0.6 1
D. Draw ten random samples of length 100 of an exponential distribution and average them.
Compare a histogram of one of the samples with a histogram of the mean.
E. Draw a random sample of size 250 of a P(25) distribution. Plot a histogram and a normal
probability plot. How normal does this look?
2. Parameter estimation
100
100
80
80
60
60
40
40
20
20
0
0
1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0
sample mean.
The sample mean is an example of a moment estimator. More generally, we can replace, in the
formula of the moment, the expectation operator E for an average of the corresponding powers of
the observations. For instance, the statistic
n
1 X 2
2 = Xi X
n i=1
is the moment estimator of the variance. Unfortunately, the expectation of this statistic does not
coincide with 2 . Indeed, since all the terms of the sum in the right have the same expectation
(we assume here = 0, to shorten the equations),
1 2 1 2
E 2 = E X1 X = E X1 + E X 2 2E X1 X
n n !
Xn
1 2 2 1 2 2 (n 1) 2
= + 2 E X1 Xi = 2 + 2 = .
n n i=1
n n n n
is preferred as an estimator of 2 . S 2 is called (in most places) sample variance. The formula for
var[S 2 ] is more complex, involving the kurtosis. Nevertheless, under normality, the distribution
of the sample variance can be related to the 2 model (the proof involves some matrix algebra).
More specifically,
(n 1)S 2
2 (n 1).
2
In particular,
2 4
var S 2 = .
n1
The examination of the joint distribution of the mean and sample variance shows that, again under
a normality assumption, these estimators are independent. Assuming = 0 to make it shorter, we
3.0
150
2.5
Sample variance
2.0
100
1.5
1.0
50
0.5
0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 1.0 0.5 0.0 0.5 1.0
have h 2
i Xi 2
E Xi X X = E Xi X E (X)2 = E = 0,
n n
meaning that X and Xi X are uncorrelated. Under normality, this implies that they are inde-
pendent, and so are X and S 2 .
Unfortunately, there is no complete agreement on the denominator in the formula of the variance
estimator, and some authors use n in the denominator.
Example 2.1. Sampling distributions are easily understood through simulation. In this example,
n = 10 and the distribution sampled is N (0, 1). We generate 1000 samples, saving the means,
medians and variances in a data set that contains 1,000 observations of each of these three statistics.
The histograms of the sample mean and the sample median (the sample median of 10 observations
is the midpoint between the fifth and the sixth observations) are shown in Figure 2.1. The means
are 0.0039 and 0.0012, respectively, close to the zero population mean. The standard deviation
of the sample mean is 0.3112, close to the theoretical value 10001/2 = 0.3162. The standard
deviation of the sample median is 0.3652, a bit higher. This agrees with the theory (paragraph
2.3), and supports the preference for the mean.
The left panel of Figure 2.2 is a histogram of the sample variance, consistent with the expected
2 profile. The mean is 1.0021 and the standard deviation 0.4732. Note that, according to the
theory, the sample variance is one ninth of an observation of a 2 (9). The correlation between the
sample mean and variance is 0.0167, also in agreement with the theory. This is illustrated in the
right panel of the figure.
The standard deviation of an estimator is usually called the standard error. We denote it by se[].
Among unbiased estimators, the one with the lower standard error is preferred. This is called
efficiency. More explicitly, if 1 and 2 are unbiased estimators of , we say that 1 is more efficient
than 2 when se[1 ] se[2 ]. For instance, both the sample mean and the sample median can
be used as estimators of for an N (, 2 ) distribution, but the mean is more efficient. We have
found this in the simulation of Example 2.1, but a mathematical proof is more difficult. In many
cases maximum efficiency is sought among linear estimators, leading to the concept of best linear
unbiased estimators (BLUE). Best means here minimum variance.
Another approach to the assessment of an estimator is based on its convergence as the sample size
tends to infinity. In this line, a common requirement for an estimator is consistency. Let me use
the notation n , to emphasize the dependence of the estimator on the sample size n. We say that
the sequence n is consistent when plim n = . Based on the Chebychev inequality (section 1), it
can be shown that lim se[n ] = 0 implies consistency. Thus, those estimators whose variance have
an n in the denominator, as the sample mean and variance, are consistent.
Another desirable property is asymptotic normality. A sequence of estimators n is asymptotically
normal when the CDF of (n E[n ])/sd[n ] converges to the standard normal CDF as n .
This means, in practice, that certain estimators are taken, for big samples, as if they were normally
distributed. For instance, owing to the central limit theorem, the sample mean is asymptotically
normal. Also, the maximum likelihood estimation method, that we leave for the econometrics
course, produces asymptotically normal estimators in many situations, making the inference from
the estimates much simpler.
All these definitions can be extended to estimators of a multidimensional parameter without pain
(assuming that you are familiarized with matrix and vector formulas). If we take a random vector
as an estimator of a parameter vector , the bias is a vector and the MSE a matrix, given by
h T i T
MSE[] = E = B B + cov .
Also, the variance is replaced by the covariance matrix in the efficiency comparisons: 1 is more
efficient than 2 when cov[2 ] cov[1 ] is positive semidefinite. Although the definitions are so
easily extended, handling efficiency becomes a bit involved. I leave this here.
X
Z= N (0, 1).
/ n
3 2 1 0 1 2 3
X
T = .
S/ n
The distribution is no longer the standard normal, but a different distribution, called the Student t
distribution. It is a symmetric distribution, with zero mean and a bell-shaped density curve, similar
to the N (0, 1) density (Figure 2.3). Like the normal, the Students t is not a single distribution,
but a distributional model, in which an individual distribution is specified by a parameter, which,
as in the 2 , is a positive integer called the number of degrees of freedom (df).
The formula for the Students t(n) PDF (n is here the number of degrees of freedom) is
(n+1)/2
(n + 1)/2 x2
f (x) = 1+ .
(n)1/2 n/2 n
As given here, this formula still makes sense when n is not an integer. Nonintegers can be used
in certain nonstandard tests.
For an alternative definition, take two independent variables X and Y , and the t Student is
produced as
X
X N (0, 1), Y 2 (n) = t(n).
Y
Because of the symmetry with respect to zero, the mean and the skewness of a t distribution are
null (the skewness converges only for n > 3). For n > 2, the variance is n/(n 2) (infinite for
n = 1) and the kurtosis 6/(n 4) (n > 4). Since this is relevant for a low n, the Students t is
sometimes used in finance to replace the normal as a model for the distribution of the returns of
a security index.
We denote by t (by t (n) if there is ambiguity)
the critical values of the Students t. This means
that, if X has a t distribution, then p X > t = . The Students t converges (in distribution)
to the standard normal as the number of degrees of freedom tends to infinity. This, in practice,
means that, denoting by Fn the CDF of the t(n) distribution, we have, for x R and 0 < < 1,
This formula gives limits for , called the 95% confidence limits for the mean. If X is not normally
distributed but n is high (in many cases it suffices with n > 25), this formula gives an approximation
which, in general, is taken as acceptable. Replacing 1.96 by an adequate critical value z , we can
switch from the 95% to our probability of choice. Thus, the formula
x z
n
gives the limits for a confidence level 1 2. If the confidence level is not specified, it is understood
that it is 95% ( = 0.025). With the confidence limits, we can compare the sample mean x to a
reference value 0 . If 0 falls out of the limits, we conclude, with the corresponding confidence
level, that 6= 0 . We say then that the difference x 0 is significant.
With real data, is unknown, but, for a big n, it can be replaced by s, obtaining an approximate
formula for the confidence limits of the mean. Nevertheless, there is an exact formula, appropriate
for a small n (the difference between both formulas becomes irrelevant for big samples), in which
z is replaced by t (n 1). The formula is then
s
x t (n 1) .
n
All this was said assuming a normal distribution. If this assumption is not valid, the formula of the
confidence limits is still approximately valid for big samples, by virtue of the central limit theorem.
In such case, using either z or t does not matter, since they will be close. An application of this
ideas is the formula of the confidence limits for a proportion, discussed in the next paragraph.
Example 2.2. Over the last decade, several large-scale cross-cultural studies have focused on well-
being in a wide range of nations and cultures, but, in general, Latin countries have only been
sporadically represented in these studies. In a recent study, the influence of gender, marital status
and country citizenship on different aspects of well-being has been examined, testing the uniformity
within the Latin world and comparing the variance due to the country effect with those due to
the gender and marital status effects. The data were collected on a sample of managers following
a part-time MBA program at business schools from nine Latin countries. I use here data on job
satisfaction (average of a 12-item Likert scale) from three countries, Chile (CH), Mexico (ME) and
Spain (SP).
5
25
Job satisfaction
Proportion
20
4
15
10
3
5
2
0
2 3 4 5 6 2 1 0 1 2
Figure 2.4. Histogram and normal probability plot for the Chile group (Example 2.2)
After removing the cases with missing values (listwise), the sample size is n = 423, and the group
sizes n0 = 121, n1 = 111 and n2 = 191, for Chile, Mexico and Spain, respectively. The group
statistics are reported in Table 2.1. In Stata, these statistics are obtained with the command
tabstat.
TABLE 2.1. Group statistics (Example 2.2)
We calculate next 95% confidence limits for the mean of the Chile subpopulation. Based on
t0.025 (120) = 1.980, we get 4.158 0.162. Of course, for such a sample size, using the t or the
N (0, 1) critical value does not matter. Also, we can leave aside the concern about the normality
of the distribution, although the diagnostic plots of Figure 2.4 show that normality is questionable
here. Note that, since we are dealing with the Chile mean, the point is whether we assume or not
the normality of the distribution in the Chile subpopulation. This is not a trivial issue, because
we dont make any assumption on the global three-country distribution. It is not clear what such
distribution would be, not only because the means may be different, but because we are sampling
from the three countries in proportions that have no meaning in population terms.
In Stata, the limits can be obtained with directly, using invttail(120,0.025), but there is a
CI calculator that performs these calculations without bothering the user with critical values
(command ci).
Source: S Poelmans & MA Canela, Statistical analysis of the results of a nine-country study
of Latin managers, XIth European Congress on Work and Organizational Psychology (Lisboa,
2003).
Survey designers called the summand on the right the sampling error (that is how they report
the error). The sampling error is an assessment of the magnitude of the error that we can get
in extrapolating to the population the estimate derived from a random sample (be careful here,
randomness is not granted in many surveys). If nothing is said in the contrary, z = 1.96 is used.
Sometimes, the sampling error is calculated previous to the survey, when p is still unavailable. If
an initial estimate (or guess) is available, it is used in the formula. If not, we use p = 0.5, which
is the worst possible case, since the maximum value of p(1 p) is attained for p = 0.5 (check
this!). In this case, rounding z0.025 = 1.96 to 2, the sampling error can be approximated by 1/ n.
This provides the practitioners with a rule of thumb: for n = 100, the sampling error is 10%, for
n = 400, it is 5%, etc. This explains the sizes used in the surveys that are currently reported in
the media, in which going far beyond n = 1600 would not compensate the increase in the cost of
the survey.
Example 2.3. A survey is planned on the consumption of soft drugs by boys/girls of ages from 15
to 20 years in a certain population. Assuming simple random sampling from a big population and
a percentage of consumption of about 20%, how big must the sample be for ensuring, with a 95%
confidence, that the error in the percentage estimated is lower than 5%?
The sample size n must be such that the sampling error corresponding to a 95% confidence level
is less than 5%. Taking p = 0.20, this means that
r
0.2 0.8 1.962 0.2 0.8
1.96 < 0.05 = n > = 245.86.
n 0.052
Therefore, the sample size must at least 246. If the initial 20% estimation were not available, we
would use the 50%. This would lead is to
1.962 0.5 0.5
n> = 384.15.
0.052
2.7. Homework
A. Suppose that a random variable X can take the values 1, 2, 3, 4, 5, with probabilities
f (1) = 3 , f (2) = 2 (1 ), f (3) = 2(1 ), f (4) = (1 )2 , f (5) = (1 )3 .
3.1. An example
The logic of hypothesis testing is not obvious for the beginner. Only with some experience in
testing one realizes the efficiency that results from using the same argument in many different
situations. Thus, I skip the theoretical discussion, although something else will be said in section
5. So, instead of a formal definition, we start with an example.
Example 3.1. The data set for this example includes data on wages in years 1980 and 1987. The
sample size is n = 545. The variables are: (a) NR, an identifier, (b) LWAGE0, wages in 1980, in
thousands of US dollars and (c) LWAGE7, the same for 1987. The wages come in log scale, which
helps to correct the skewness. Do these data support that there has been a change in the wages?
Note that we dont care about individual changes, but about the average change.
To examine this question, we introduce a new variable X, corresponding to the difference between
these two years (1987 minus 1980). We search for evidence that the mean of X is different of zero.
Our first analysis is based on a confidence interval. The basic information is
x = 0.473, s = 0.606, n = 545.
Given the sample size, we shouldnt worry about the skewness, but, anyway, the histograms of
Figure 3.1 support the use of the log scale. With the appropriate t factor (t0.025 (544) = 1.964), we
get the 95% confidence limits 0.422 and 0.524. Since this interval does not contain zero, we can
conclude (95% confidence) that there has been a change. We say then that the mean difference
0.473 is significant or, more specifically, that it is significantly different from zero.
200
400
150
300
100
200
50
100
0
0
0 10 20 30 40 50 1 0 1 2 3 4
30
30
25
25
20
20
15
15
10
10
5
5
0
0
2 3 4 5 6 1 2 3 4 5 6
Figure 3.2. Distribution of job satisfaction in Chile and Mexico (Example 2.2)
which has a N (0, 1) distribution under the null, can be used if is known. Note that, in Example
3.1, replacing t(n 1) by N (0, 1) means changing the critical value 1.964 by 1.960. Without
normality, the same methods are approximately valid for big samples. It is generally agreed that
n 50 is big enough for that.
X1 X2
Z= p ,
(1/n1 ) + (1/n2 )
with P -values provided by the N (0, 1) distribution. If is unknown, we replace it by the pooled
variance
n1 1 n2 1
S2 = S12 + S2,
n1 + n2 2 n1 + n2 2 2
3.0
2.5
2.0
LWAGE7
1.5
1.0
0.5
0.0
1 0 1 2
LWAGE0
X1 X2
T = p ,
S (1/n1 ) + (1/n2 )
whose P -values are given by the t(n1 + n2 2) distribution. In a second variant of the test, it is
not assumed that 1 = 2 (nor that they are different). We use then
2 2
X1 X2 S1 /n1 + S22 /n2
T =p 2 , df = .
(S1 /n1 ) + (S22 /n2 ) (S12 /n1 )2 (S 2 /n2 )2
+ 2
n1 1 n2 1
Under the null, the distribution of T can be approximated by a Student t, with the degrees of
freedom given by the second formula. This is called the Satterthwaite approximation. The number
of degrees of freedom is rounded when the test is done by hand.
Example 2.2 (continuation). To illustrate the two-sample t test, I apply it to the mean difference
between Chile and Mexico in Example 2.2. I use first the equal-variances version. We have
Then,
r
120 0.9022 + 110 0.8652 4.158 4.413
s= = 0.884, t= p = 2.196.
230 0.884 1/121 + 1/111
Therefore, P = 0.029. This can be obtained, in Stata, with 2*ttail(220,2.196). The t statistic
and the P -value are directly obtained, with the ttest command, whose default is the equal-
variances option. With the option unequal we get the alternative version, which gives t = 2.200
(df = 230, P = 0.029). As expected, the differences between the two versions of the test are
irrelevant.
Example 3.1 (continuation). The analysis of section 3.1 would usually be run as a paired data t test.
What if we do the wrong thing, applying a two-sample test? We get then t = 15.19 (df = 1088,
P < 0.001). So, the two tests lead to the same conclusion. Is this true in general? The answer is
no. In this case, in spite of the results being similar, the two-sample test can be easily shown to be
wrong, since the two variables are positively correlated (r = 0.310), which makes sense, since most
of the people with higher wages still get high wages seven years later. We will see in chapter 10
how to test the null = 0. For the moment being, we satisfy ourselves by illustrating this question
with Figure 3.3.
Some popular tests, such as those used in the analysis of variance and linear regression, are based
on the ratio of two sample variances. They are based on a well known model for the distribution
of these is ratios. For two independent samples with distributions N (1 , 2 ) and N (2 , 2 ), the
ratio of the sample variances,
S2
F = 12 ,
S2
follows an F distribution. The general formula for the PDF of an F distribution with (n1 , n2 )
degrees of freedom, briefly F (N1 , n2 ), is
n /2 n /2
n1 /2 + n2 /2 n1 1 n2 2 xn1 /21
f (x) = (n1 +n2 )/2 x > 0.
n1 /2 n2 /2 n2 + n1 x
An example is shown in Figure 3.4. The first factor is a normalization constant. n1 and n2 are
the parameters of this model. When it is used as a model for a ratio of two sample variances, n1
is associated to the numerator and n2 to the denominator.
We denote by F the critical value
associated to right tail. More explicitly, p F > F = . In Stata, calculations related to the F
distribution can be managed through the functions Ftail, invFtail and upperFtail.
With the and the t, 2 and F distributions are presented in textbooks as the distributions derived
from the normal. The F distribution can be related to the other two in two ways:
X1
Assuming independence, X1 2 (n1 ), X2 2 (n2 ) = F (n1 , n2 ).
X2
X t(n) = X 2 F (1, n).
It is important to keep in mind this second property, which implies that any t test can be seen as
an F test. The only thing lost in taking the squares is the sign. A consequence of this relationship
is that
|t(n)| > c F (1, n) > c2
or, equivalently, t/2 (n)2 = F (1, n).
Note that the numerator of this fraction is a sum of squares related to between-group variation,
whereas the denominator is related to within-group variation. The number of summands is the
same, n. The advantage of this expression is that it can be easily generalized to the case of k
samples, leading to the one-way ANOVA F test. Suppose now k independent samples, of sizes
n1 , . . . , nk , and let n = n1 + + nk be the total sample size. The data can be arranged as in
Table 3.1, where each group is a column (the columns can have different lengths) and the last row
contains the group means.
The denominator of the fraction in the above formula is equal to the sum of squares within sample
1, plus the sum of squares within sample 2. To generalize this, we use the within-group sum of
squares
Xn1 nk
X
2 2
SSW = x1j x1 + + xkj xk = (n k)s2 .
j=1 j=1
The numerator can be regarded as a sum of squares, some of them repeated, so that the number of
summands is the same as in the denominator. The generalization to k groups is straightforward,
Presenting the data as in Table 3.1 helps to understand what these sums of squares are. We can
consider two different sources of variability in this table. Vertically, we see the variability within
the groups, measured by SSW. Horizontally, in the means of the last row, we see the variability
between the groups, measured by SSB.
TABLE 3.1. Data for a one-way ANOVA test
x1 x2 xk
Note that, for k = 2, the factor k 1 can be omitted, leading to the formula given above. Under
the null, this F statistic has an F (k 1, n k) distribution, which can be used to calculate a
P -value.
With (2, 429) degrees of freedom, this leads to P = 0.029 and, therefore, to the rejection of the
null. We can conclude, then, that are differences between countries.
A number of degrees of freedom is assigned to each sum of squares. Roughly speaking, it is the
number of independent terms in the sum. Since y is the mean of all the observations yij , SST has
n 1 degrees of freedom. In SSB, there are k different terms, but the sum of the k deviations yi y
Total SST N 1
Next, a mean square (MS) is calculated, dividing the sums of squares by their respective numbers
of degrees of freedom. The F statistic is the ratio
MSB
F = .
MSW
Note that the within-group mean square MSW is equal to the pooled variance s2 . In one-way
ANOVA, the conditions for the validity of the F test are the same as in the two-sample t test. The
data set is partitioned into k groups, which are assumed to be independent samples of N (i , 2 )
distributions i = 1, . . . , k. Whether these assumptions are acceptable is usually checked through
the residuals. In one-way ANOVA, the residuals are the deviations with respect to the group
means, eij = xij xi . If the one-way ANOVA assumptions were valid, the residuals should look
as a random sample of the N (0, 2 ) distribution, and this is what we check in practice. The
assumption that the variance is the same for all samples is called homoskedasticity. This topic will
be discussed in depth in the econometrics course.
TABLE 3.3. ANOVA table (Example 2.2)
Example 2.2 (continuation). The ANOVA table corresponding to Example 2.2 (Table 3.3), can be
obtained in Stata with the command oneway, but it does not provide the residuals. The command
anova is much more powerful, since it allows for more complex forms of analysis of variance. Note
that, in anova, the variable that defines the groups must be coded as a numerical variable. If it is
not so, in your data set, it can be easily changed with the command encode. Also, anova allows
postestimation commands.
A postestimation command is one that profits the results of the last estimation command to
produce additional results, not included in the initial output. Probably, predict is the most
widely used of these commands. The option residuals produces here a new variable containing
the ANOVA residuals. You can see the histogram and the normal probability plot in Figure 3.5.
The skewness is 0.5. So far, the normality assumption is not clear at all, but we wouldnt worry
about the validity of the conclusion, since, with such samples sizes, the F test is safe enough.
2
1
80
0
60
1
40
2
20
3
0
3 2 1 0 1 2 3 2 1 0 1 2 3
3.8. Homework
A. Reanalyze the data of Example 3.1 after the wages putting back in the original scale (no
logs). How does the interpretation of the mean difference change?
B. Draw 250 independent random samples of size 5 from N (0, 1) and calculate the sample
variance for each sample. The same for N (1, 1). Divide the first by the second, getting 250
F statistics and plot a histogram. Compare this histogram with Figure 3.4.
C. The data set for this exercise comes from the same study as Example 7.2, but includes data
from nine countries. Test the differences among countries using the methods of this chapter.
Taking SSR as a function S(b0 , b1 ) of the coefficients, this is an unrestricted optimization problem
with a quadratic objective function, which is easy to solve through differential calculus. The
gradient vector and the Hessian matrix are, respectively,
" # " #
n(y b0 b1 x) n nx
2
S(b0 , b1 ) = 2 P P , S(b0 , b1 ) = 2 P 2 .
xi yi nb0 x b1 x2i nx xi
The line of equation y = b0 + b1 x is called the regression line (of y on x). b0 and b1 are the
regression coefficients: b1 is the slope and b0 the intercept or constant. It follows from the formula
of the intercept that y = b0 + b1 x, meaning that the regression line crosses the average point x, y .
This is equivalent to the sum (and the mean) of the residual being equal to zero. A third way of
saying the same thing is writing the equation of the regression line as y y = b1 x x .
which can be written, in short, as SST = SSE + SSR. Here, SSE the sum of squares explained by
the regression. We rewrite the ANOVA decomposition as
SSE SSR
R2 = , 1 R2 = .
SST SST
R2 is called the R-squared statistic (also coefficient of determination). it is also referred to as the
percentage of the variation explained by the regression. To understand how to use this decomposi-
tion, look at the extreme cases. If R2 = 0, then b1 = 0, the regression line is the horizontal line of
equation y = y and the residuals are the deviations with respect to the mean, ei = yi y. There
is no fit at all. When R2 = 1, all the residuals are null, meaning that the points (x, y) are aligned
and the fit is perfect. These two extremes are never found with real data and, in practice, we take
the proximity of R2 to 1 as an indication of good fit.
50000
30000
MARKET
This tells us that the sample correlation is a standardized regression slope. If x and y have unit
variance, the slope is equal to the correlation. Also, with a bit of algebra, it is proved that R2 is
the square of the correlation, justifying the use of the letter R.
Some prefer to write the equation of the regression line as y y = b1 (x x), avoiding the intercept.
Finally, writing the equation of the regression line as
y y x x
=r ,
sy sx
we see that, if both variables are standardized, the slope coincides with the correlation and the
intercept is zero. The former is no longer true in multiple regression, where standardized regression
coefficients are not correlations, although, in most cases, they look as if they were.
Example 4.1. The data for this example come from a study on the salaries of academic staff
in Bowling Green State University. This data set has been used in several textbooks and can be
considered as a standard example. The sample size is 514. We use here the variables: (a) SALARY,
academic year (9 month) salary in US dollars, and (b) MARKET, marketability of the discipline,
defined as the ratio of national average salary paid in the discipline to the national average across
disciplines.
It is natural here to take MARKET as the independent variable, fitting a regression line to the
data to produce an equation for predicting the salary from the marketability. Denoting them as x
and y, respectively, we get
hence
12,672.7
b1 = 0.407 = 34,545.2, b0 = 50,863.87 34,545.2 0.948 = 18,097.0.
0.149
We have thus obtained the equation of the line for the regression of SALARY on MARKET,
We can see in Figure 4.1 a scatterplot of these data, with the regression line superimposed.
Here, 0 , 1 and 2 are the parameters of the model, on which the inference is to be done.
As a regression equation. We assume that
Y = 0 + 1 X + , N (0, 2 ), E |X = 0.
You can easily derive the first formulation from the second one by fixing
X and
taking expectation
and variance, and the second from the first one by putting = Y E Y |X . But let me insist on
the main assumptions, since their failure provides much of the motivation for alternative methods
in the econometrics course:
Linearity. Although most analysts apply linear regression methods taking linearity as
granted, nonlinearity issues must be taken into account. In many cases, they are easily
fixed.
Uncorrelated error term. E |X = 0. It can be proven, using the properties of the
conditional expectation, that this implies that X and are uncorrelated, although both
things are not the same. Nevertheless, under joint normality, this property, uncorrelat-
edness and independence are the same.
Homoskedasticity. The fact that the error variance is constant receives this pitturesque
name. When this assumption fails, we say that there is heteroskedasticity. Methods
for dealing with heteroskedasticity will be introduced in the econometrics course.
4.5. Estimation
It follows easily from the assumption of the linear regression model that
cov[X, Y ]
1 = , 0 = E[Y ] 1 E[X].
var[X]
We can obtain moment estimators for these parameters replacing means, variances and covariance
by their sample versions,
sxy
b1 = 2 , b0 = y b1 x.
sx
The square root s is sometimes reported as the standard error of the regression. These estimators
are unbiased (if we want s2 to be unbiased, the denominator must be n 2). With a bit of algebra,
we get X
2 x2i 2
var[b0 ] = X , var[b1 ] = X .
n (xi x)2 (xi x)2
Estimates of the standard errors are obtained by replacing 2 by s2 . Inference (confidence limits
and testing) on the regression coefficients is based on these estimated standard errors. Mind that
the variances are conditional to the values of X that we have in our data set. This is evident
from the fact that these values are involved in the formulas. The variance of b1 can be reduced
by augmenting the variation of X. This is used in experimental design to improve the precision of
the slope estimates.
4.6. Testing
If the error term is normally distributed, we can calculate 12 confidence limits for the regression
coefficients, putting b t (n 2) se[b]. Standard errors can also be used to run a t test. For the
null H0 : i = 0, we use
bi
t= ,
se[bi ]
with df = n2. In the computer, the standard regression output contains, besides every parameter
estimate, the standard error, the t statistic and the P -value (Table 4.1). The report also includes
the ANOVA decomposition of paragraph 4.2. It can also be seen that the F statistic
SSE (n 2)R2
F = n2 =
SSR 1 R2
is the square of the t statistic associated to 1 , so that it is redundant here (this is no longer true
in multiple regression). The second formula, involving R2 , shows that the significance of the slope
occurs when R2 is close enough to 1. We say that R2 is significant, or that the correlation is
significant, when the F statistic is so. It is also obvious, from the formula, that weak correlations
can be significant with samples big enough.
A noteworthy detail about the coefficient estimators b0 and b1 is that they are correlated. This
means, in practice, that it is not correct to make inferences about both coefficients separately, as
we do with the mean and the variance (which are uncorrelated). We skip the formula here, because
the right setting for the discussion is that of the general regression model, where we find a formula
for obtaining the covariance matrix of the regression coefficient estimators at once. Nevertheless,
this issue is lightly touched in the homework (exercise E).
The analysis of the residuals is useful for checking the validity of the model. This analysis is
similar to that of the residuals of the one-way ANOVA, plus the possibility of a residual plot,
where we place the residuals (standardized or not) in the ordinates, and xi , the predicted values
yi = b0 + b1 xi or the order in which the data were obtained in the abscissas.
What can be done without the normality assumption? Practically the same, for big samples.
Indeed, it can be shown that the OLS estimators of the regression coefficients are asymptotically
normally distributed, so that most of the previous discussion remains valid.
Residuals
40
0
20
20000
0
Example 4.1 (continuation). In Stata, we obtain a standard regression report with the command
regress. This report contains a table with the coefficient estimates (Table 9.1), an ANOVA table
and some additional results. In the coefficients table, we find the coefficient estimates, the standard
errors, the t statistic and the P -value.
TABLE 4.1. Linear regression results (Example 4.1)
Apart from the ANOVA table, we find F (1, 512) = 101.8. This is the square of the t statistic
for MARKET (t = 10.09), with the same P -value. R2 = 0.166 appears below. The adjusted R2
statistic, used in multiple linear regression to compare models with different number of independent
variables, is irrelevant here.
The covariance matrix of the coefficient estimators is not reported, but Stata saves it as e(V). This
means that, naming this matrix, we can use it in any calculation. In this example, we get (the
order is the same as in Table 4.1)
11,726,058
cov[b]
= .
11,122,417 10,811,002
4.7. Homework
A. Because of the skewed distribution of variables such as salaries, sales or size, econometricians
5. More on testing
Example 5.1. We use simulation to calculate the power of a test of the null = 0 for a normal
with known = 1 based on the statistic
X
Z= .
S/ n
We assume = 0.5 and perform the simulation for n = 5 and for n = 10. We draw first 10,000
B1 Bj Bc Total
In Table 5.1, nij is referred to the joint occurrence Ai Bj , the row total ni+ to the occurrence of
Ai and the column total n+j to Bj . The proportions are
The null hypothesis of an independence test is that the two categorical variables, rows and columns,
are statistically independent. This means that the product formula ij = i+ +j holds for every
cell. In the test, we take pij as the observed proportion and pij = pi+ p+j as the expected
Example 5.2. Table 5.2 shows fictional data on the purchases of products, A, B and C. The sample
is partitioned by age: young adults (1835), middle aged (3655) and senior (56 and older).
TABLE 5.2. Contingency table (frequencies)
Group A B C Total
Young adults 20 20 20 60
Middle age 40 10 40 90
Senior 20 10 40 70
Table 5.3 is the corresponding table of proportions, obtained by dividing each frequency by the
total sample size. Expected proportions have been included in the table, in parenthesis.
TABLE 5.3. Contingency table (proportions)
Group A B C Total
Table 5.4 contains the residuals for the proportions of Table 5.4. The 2 statistic is
X 2 = 220 (0.026)2 + (0.186)2 + = 17.64 (df = 4, P = 0.001).
Group A B C
The same test can also be used to test the homogeneity of the probability distribution across
populations, that is, a null H0 : 1j = . . . = rj , where j = 1, . . . , c are the populations. This
is called a homogeneity test. The data can be presented as in Table 5.2, the only difference being
that the row totals n1+ , . . . , nr+ , that will correspond in this case to the sizes of the samples
drawn from the populations compared, will be specified in the data collection design.
Under normality, JB is asymptotically 2 (2) distributed. Instead of the Jarque-Bera test, Stata
provides an alternative test, also based on the skewness and the kurtosis (command sktest).
Example 1.2 (continuation). The sample skewness and kurtosis of the Brazil returns are, respec-
tively,
Sk = 0.193, K = 0.842,
Although we obtain thus a nonsignificant z value for the skewness, z = 1.273 (P = 0.203), the
kurtosis is highly significant, z = 2.776 (P = 0.005). Therefore, we reject the normal distribution,
due to the significant positive kurtosis found in the data. The Jarque-Bera statistic, JB = 9.330
(P = 0.009), leads us to the same conclusion. Less sharp results are given by the Kolmogorov-
Smirnov (D = 0.039, P = 0.824) and Shapiro-Wilk tests (W = 0.9901, P = 0.0715).
Zero observations are discarded in this test. This is not relevant as far as the continuity assump-
tion, under which repetition is not expected, is tenable.
Example 3.1 (continuation). In Example 3.1, we find 443 cases in which the wages have been
increased. The (two-tail) P -value can be obtained in Stata with the binomialtail function,
which gives P < 0.001. This is consistent with the outcome of the t test. The asymptotic P -value
would be based on the N (272.5, 272.5) distribution. This test can be directly performed in Stata
with the command signtest.
If there are ties in the absolute values, they get an average rank. The variance must be then
corrected. Stata command signrank provides a correction for this case. To get exact significance
levels for this test and those which follow, one should look at the corresponding tables or use a
special package. What we usually find in a generalist stat package is an asymptotic significance
level. The difference may be relevant for small sample studies (eg in biostatistics), but not the
sample sizes that we usually find in econometrics. In fact, the tables that we find in textbooks do
not go beyond n = 20. Asymptotic P -values are based on a normal approximation whose mean
and variance are given by the above formulas. Stata reports the z value associated to T+ (i.e.
subtracting the mean and standard deviation given by the above formulas).
Example 3.1 (continuation). With the Stata command signrank, we get z = 15.9 (P < 0.001).
= 1 2 is sometimes called the treatment effect. The null is = 0. The test statistic W is
obtained as follows:
The two samples are merged, and the the resulting sample (size n1 + n2 ) is sorted.
We assign ranks to the observations, averaging ties.
W is the sum of the ranks of the first sample.
Under the null, W has a symmetric (discrete) distribution, with
n1 n1 + n2 + 1 2 n1 n2 n1 + n2 + 1
= , = .
2 12
As in the signed rank test, exact significance levels are usually extracted from tables, but only for
small samples. For n2 > 10, asymptotic levels are accepted.
Example 2.2 (continuation). To compare Chile and Mexico as in section 3, we use the Stata
command ranksum, getting z = 2.224 (P = 0.026), similar to the t test.
where Ri is the sum of the ranks of sample i and n is the total sample size (n = n1 + +nk ).
For ni 5, the distribution of H can be approximated by a 2 (k 1).
Example 2.2 (continuation). To compare the three countries, we use the command kwallis. This
gives 2 (2) = 9.46 (P = 0.009), more significant than in Table 3.3.
5.10. Homework
A. The results of Table 5.5, where companies have been classified according to their activity in
two sectors, Production and Services, come from a study on work-family conciliation. The
columns of the table correspond to the responses to the question:
Are all the managers in your company concerned with work-family balance?
Test the effect of the type of activity on the concern with work-family balance using a chi
square test.