Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A (Very) Brief Review of Statistical Inference

Breno Schmidt
These notes provide a brief introduction to statistical inference. As will become evident,
I sacrificed formalism to what I think will be a better understanding of such an important
concept.

Some preliminaries

What is a random variable? As the name suggests, a random variable is a


will take an unknown value. For instance, the number of heads in 100 tosses
random variable. Call this quantity H. Note that, although H is not known
the tosses are maid, one can place some restrictions on the possible values it

variable that
of a coin is a
for sure until
can take. Of

course, H has to be an integer between 0 and 100. We know that for sure. We also know (or
at least suspect) that if we repeat the the 100-toss experiment one million times and average
the outcomes H, we will most likely find a number very close to 50. Also, if the coin is fair,
we would be very surprise if we found a value H of close to 0 or to 100.
Thus, there is some statistical model driving the random variable H. In fact, one can
actually calculate the probability that the random variable H will take each value from 0 to
100. This function is called the probability distribution function of the random variable H.
The famous Normal Distribution is an example of such a function.

1.1

The Normal Distribution Function

Roughly speaking, the normal distribution says that the outcomes of a (continuous) random
variable X become more and more unlikely the further we depart from the mean of the
distribution.
The normal distribution is characterized by two parameters: its mean and its variance.
If you know these two parameters you know the entire distribution. In fact, given a normal

These notes are still incomplete and probably full of mistakes. Please, send any comments and corrections
to breno.schmidt@usc.edu.

with a particular set of parameters (say, mean of 0 and variance of 1) you can calculate the
probability function of any normal variable.
Suppose the random variable Z is normal with zero mean and variance 1. This variable
Z is said to follow a standard normal distribution. At the end of any stats book you will find
a table for this special random variable that permits the calculation of the quantity:
P (Z > z) =
or, in words, the probability that the outcome of the random variable Z will be greater than
a given value z is . If you have the table will give you the value of z and vice-versa. For
instance, if you want the probability that a random variable will end up being greater than
1.96, the table will tell you that this quantity is 2.5%.
Because this distribution is symmetric, only positive values of z are shown. Thus, the
probability that Z will be lesser than 1.96 is also 2.5%.
Now, suppose you want the probability function of a random variable Y N (1, 4). For
concreteness, suppose you want to find the critical value y such that the probability P (Y > y)
is 2.5%. To find this value we use the following important property of a normal distribution:
Y N (, )

Y
N (0, 1)

where and 2 are the mean and variance, respectively. We may call Y = Z and use
the table for a standard normal distribution. Lets do that for the example above:

Y 1
N (0, 1) = P
2

Y 1
>z
2

P (Y > 2z + 1) =
For = 0.025 we know that z = 1.96 it follows that the value y above is 21.96+1 = 4.92.
Thus,
P (Y > 4.92) = 0.025
Now we turn to the problem of inference.

The problem:

Suppose we are interested in a random variable X (e.g. returns) with the following distribution:

X N , 2
where and 2 are the unknown population mean and variance.
The problem can be stated as:

Given the random sample {x1 , x2 , ..., xN } of X, how can we estimate the
unknown mean and the unknown variance 2 ? Moreover, how reliable are
these estimates? In particular, if I have some prior idea about the value of mean,
how can I use the sample to test this (null) hypothesis?

The Estimators

The best estimators1 for the mean and variance are usually called the sample mean and
the sample variance 2 . These are defined as:
PN
t=1

=
PN
2

xt

(sample mean)

N
(xt
)2
N 1

t=1

(sample variance)

We call estimators the functions that produce the estimates for each sample.
By best estimators we mean the Maximum Likelihood Estimators. Roughly speaking, the estimates
from these estimators are
the more likely values of the mean and variance given the sample. The MLE for
PN
)2
t=1 (xt
the variance is actually
. However, since this is a biased estimator we slightly modify it as shown
N
below. The reason we do this is because we lose one degree of freedom when estimating the mean in the
formula for the sample variance.
2

3.1

Properties of the sample mean and sample variance

It is important to note that the estimators are random variables themselves. Roughly speaking, we dont know their values until we are given the sample. However, even before we
observe the sample, we can be sure that these estimates are:
1. Unbiased
An estimator is called unbiased if its mean equals the true unknown population parameter. For instance, it can be (easily) shown that:
E (
) = E (X) =
and

2
E
= V ar (X) = 2

Intuitively: if you are given a very large number of random samples and, for each of
then, you calculate estimates of and 2 , the average of these estimates will be equal
to the true population parameter you are interested in estimating. Thus, on average,
your estimates will be right!
2. The sample mean is normally distributed with mean and variance

2
,
N

2
N

Why? Well, we know that the sum of independent normal random variables is also a
normal random variable. Moreover, the sample mean is an unbiased estimator, so its
expectation is . As for the variance, note that:
2
var (
) = E
E 2 (
)
2
=E
2

but

2
E
=E
1
= 2
N

N
t=1

N
X
t=1

xt

!2
=

!
N
X
2
E (X 2 )
E xt + 2
E (xt ) E (xj ) =
+ 22
N
t>j

+ 2
N

where the last equality follows from var (X) = 2 = E (X 2 ) E 2 (X).


Thus, we can express the variance of
as
2
+ 2 2
N
2
=
N

var (
) =

3. The (scaled) sample variance is a Chi-Square random variable with N-1 degrees of
freedom:

2
(N 1) 2 2N 1

The reason is also straightforward: we know that the sum of the squares of independent
standard normal random variables is a 2 with degrees of freedom equal to the number of
independent variables. Convince yourself that the above formula is true by substituting
2
the formula for the sample variance in the variable (N 1) 2 .

How reliable are the estimates of the mean?

We will cover the basic statistical inference problem here: the estimation of a confidence
interval for the mean. Statistical inference is a tuff concept if youve never given it some
serious thought. Basically, what we do is to assume that the null hypothesis is true and then
try to find evidences in the sample against that hypothesis. That kind of argument, although
very straightforward, may seem a bit confusing at first.
Here is the basic idea: you want to test if the population mean is equal to a specific
number 0 that you have in your head (e.g. zero). Note that this number is totally arbitrary:

it represents the null hypothesis you are trying to test. By testing we mean that we will try
to find evidences in the sample that go against our prior idea (0 ) about the true mean.
In other words, you are asking the question: if our null (hypothesis) is true (that is, if
= 0 ), is the estimate that I got from this particular sample a reasonable outcome of the
distribution of
under the null?3 . Note that
you2 dont really know the mean and variance
of the random variable
(remember,
N , N where and 2 are not known). That is
why you have to assume ex-ante something about the mean and then check if that assumption
is reasonable given your sample.
OK, so lets suppose that our null hypothesis about the true mean of the estimator
is

0 4 . If the null is true then the variable /N0 is standard-normally distributed, that is

0
N (0, 1)
/ N
I would like to stress the fact that the above result is only true if our null hypothesis is true.

0
If our null hypothesis is not true, than 0 is not the true mean of
and the variable /
n
will NOT follow a standard normal distribution. Since 6= 0 , we are not standardizing the

0 because we got the mean of


0 will follow
variable, /
wrong! Actually, the variable /
n
n
a normal distribution with mean m 6= 0 and variance 1.
But wait, we dont know the value of 2 and hence we cannot compute the variable above!
However, we do have and estimator for 2 , namely the sample variance. The problem is that
0 will not follow a
the sample variance is also a random variable and thus the variable /
N
0 ?
standard normal. So, what is the distribution of /
N
FACT: If z is a standard normal random variable and q is another random variable
(independent from z) following a chi-square distribution with df degrees of freedom, then the
variable z follows a student-tdistribution with df degrees of freedom, i.e.
q/df

q/df

t (df )

Using the fact above, we note that, if the null is true:


3

If this is the first time you are seeing this material you should be very confused by now. Here is what
you should do: keep reading and when you finish this section read it all again! Everything will start to make
sense.
4
Obviously, since the estimator is unbiased, hypothesis about the true mean of
are the same as hypothesis
about the true mean of X. Keep this point in mind.

2
/
2

/ n
| {z } | {z }
2
N (0,1)

0
t (N 1)

/ N

N 1 /(N 1)

Thus, under the null, we know that the variable


N 1 degrees of freedom.

/ N

will follow a t-distribution with

OK, but how do I go about testing the null hypothesis?


0 under the null, you
Well, it is actually simple: since you know the distribution of /
N
can calculate the region under the distribution that the estimate is likely to fall (again, if
you null is right)5 . One usually define a 95% confidence interval, but this number is totally
arbitrary. The example below will guide you through the whole process.

4.1

Testing a hypothesis about the mean

Suppose that I tell you that I believe that the mean return () on a stock is 3%. Assume that
you have a sample consisting of 100 outcomes randomly drawn from the true distribution of
the stock: a normal with variance 1. How do you test the hypothesis that the true mean is
3%?
First, assume that the null hypothesis is true. In this case, you know that the random

0.03

will follow a standard normal distribution. So, before you actually use
variable z 1/
100
the sample to calculate the sample mean, you expect that the z will lie between -1.96 and
1.96 with a 95% probability (see graph below).

5
A fine point: the probability that the estimate will fall inside a predetermined interval only makes sense
before the sample is analyzed. After that, the probability is either 1 (it does) or 0 (it does not).

Probability Between Limits is 95%


0.4
0.35
0.3

Density

0.25
0.2
95%
0.15
0.1
2.5%

2.5%

0.05
0
-4

-3

-1.96

-1

0
Critical Value

1.96

This picture shows critical values for a


standard normal distribution.

Thus, assuming that the null is true, if the sample gives you a value for z of 0.7 (say) you
would not have much evidence against the null hypothesis.
But, what if you found that z = 4?
There are two possibilities: the first is that the null is right and that you were very unlucky
and got a value for z that is a very unlikely outcome of a standard normal distribution6 . The

0.03

second is that you null is wrong and that the variable z 1/


follows a normal distribution
100
with variance 1 but with mean different than 0.
It is not unreasonable to think that in the case of a z = 4, you found strong evidences
against the null hypothesis. How strong? OK, here is a more rigorous version of your findings:

at 5% level of confidence, you can reject the null hypothesis.


If you dont like this approach, you have an option that will yield equivalent results:
the confidence intervals approach. The idea is to create an interval (in the support of the
distribution of
) that you expect (before the sample is taken) your estimate to fall inside
with some (large) probability, say, 95%. How can you construct such an interval?

0 N (0, 1), then it must be true that


Well, if under the null /
N

2
0 ,
N

6
Of course, the probability of getting a 4 from a normal distribution is zero, as is the probability of getting
any real number z. But here we are willing to sacrifice mathematical rigor to get a better intuition of what
is going on.

Thus, if under the null

0
/ N

will fall between -1.96 and 1.96 with a 95% probability, it



must be true that
will fall between 0 1.96/ N and 0 + 1.96/ N with a 95%
probability. Note that all we are doinghis mapping the interval [1.96, 1.96]
in the standard

i
2
normal distribution onto the interval 0 1.96/ N , 0 + 1.96/ N in the N 0 , N
distribution.

In the example above, our 95% confidence interval would be 3% 1.96/ 100, 3% + 1.96/ 100 .
To test the null hypothesis all that you have to do is to check if your estimated mean falls
into this interval or not. If it does not, you can reject the null at a the confidence level you
chose to create the confidence interval.

4.2

But, we dont know !

Note that in the example above, we assumed that we know the variance of the true distribution of the random variable we are interested in. However, most of the times we will have
to estimate that value using the sample variance. How would we test the hypothesis that
= 0 in that case?
Simple. We saw above that the variable
t=

/ N

follows a t-distribution with N 1 degrees of freedom. Thus, all we have to do is to look


at the t-distribution table to find the critical values for that degree of freedom. Everything
else, is exactly the same.

You might also like