Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

MaMaEuSch

Management Mathematics for


European Schools
http://www.mathematik.uni-
kl.de/˜ mamaeusch

Statistic Inference

Paula Lagares Barreiro∗


Justo Puerto Albandoz∗

MaMaEuSch†
Management Mathematics for European Schools
94342 - CP - 1 - 2001 - 1 - DE - COMENIUS - C21


University of Seville

This project has been carried out with the partial support of the European Community in the frame-
work of the Sokrates programme. The content does not necessarily reflect the position of the European
Community, nor does it involve any responsibility on the part of the European Community.
Contents

1 Statistic Inference 2
1.1 Introduction to Statistic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Sample distribution of an statistic or an estimator . . . . . . . . . . . . . . . . . . . 3
1.3 Point estimation. Sample distribution of the main estimators . . . . . . . . . . . . . 5
1.4 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Estimation errors and sample size . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.1 Relationship between confidence intervals and hypothesis tests . . . . . . . . 13
1.5.2 Chi-square test for adjustment to a distribution . . . . . . . . . . . . . . . . . 14
1.5.3 Dependence and independence tests . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.4 Homogeneity test for several samples . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 An example of application of inference 19


2.1 Case of one population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Case of two populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1
Chapter 1

Statistic Inference

We will devote the chapter that we start now to see how we can obtain conclusions about a
population through the data we got from sampling. We will make an overview of the different
concepts we need to know as well as of the techniques we can use.

1.1 Introduction to Statistic Inference


We suppose that we have a sample of 60 students from a population of 544 students of a high school.
Once we have the sample data we can make ourselves questions about the whole population. For
instance, do you think that we can say that the average height of the students of the high school
is greater than 1.70 m? Do you think that 7 euros is an appropriate value to represent the average
pocket money for the students of the high school? Can we say that the data of the height of the
students we have asked are ”normal”? Statistic Inference will give us an answer for these questions.
Statistic Inference offers us a wide variety of methods which give answers to very different
questions, depending on the purpose of the study.
• Which value is a good estimation for the average height of the students of the high school?
Questions like these are answered through parametric methods, in which we suppose that the
distribution of the population and we will study its parameters.
• How is the distribution of the data about pocket money of the students of the high school?
The answer to this kind of questions that are referred to the distribution of the population is
found through non parametric methods.
There are also different methods depending of the information that we have and that we use:
• Let us suppose that the average height of the students of the high school is a fixed value that
we want to know through the information we get from the sample. Then, we say we apply
Classical Inference.
• We can also suppose that the average of the height of the students of the high school is a
random variable and that we can know some information about this variable a priori. In this
case we use Bayesian Inference.

2
In case we decide to apply techniques of classic inference, our conclusions can be obtained in
different ways:

• We can search a value for the average height of the students of the high school (that we will
find through an estimator or statistic) that we will consider as the value of the parameter. In
this case we are using point estimation.

• We can also look for a random interval inside which we could find with some ”certainty” the
real value of the parameter, for instance, the real value of the average pocket money of the
students of the high school. In this case, we are talking about interval estimation.
• Let us imagine that we have a possible value for the average of the height of the students of
the high school and we want to test is this value is ”suitable”, with some ”certainty”. In this
case we would apply an hypothesis test.

 
parametric methods
by purpose,


non parametric methods







   
Interval

Inference  Point

 

Estimation


classic inference

 
by type of information, Hypothesis



 

Test

 
 

 

Bayesian Inference
 

1.2 Sample distribution of an statistic or an estimator


We have, as we have already mentioned, a sample of size 60 of our population of the students of
the high school. We are studying the height and the pocket money of those students. If we want
to know the average of the pocket money of the students, we can use the data of the 60 students
that we have to get an approximation to what we are looking for. What could we do to ”predict”
the value of the average pocket money of the population? It is logic to think that if we calculate
the average value of the elements of the sample, we will be close to the population value. Sample
average will be for us an estimator.
So now, let us imagine that one of your colleagues has a sample of size 60 of the population,
that obviously should be different from the one we have. If he calculates the sample average in this
case, will he get the same value we got? The answer is that in general, he will get a different value.
We are interested then in knowing how the sample average varies when we change the sample data.
Our estimator is going to behave as a random variable.
We are going to call statistic to any function of the sample values. This function assigns to each
possible sample a numeric value, so what we really have is a random variable with a probability
distribution. To this probability distribution we will call statistic distribution in the sample and it
will depend on the unknown parameters of the distribution which are the purpose of the study.
We will call estimator of a population parameter to any statistic which takes a value that, for
most of the samples, is close to an unknown population parameter.

3
Example 1.2.1 Let us imagine now that we have 3 papers in a bag. We are going to make a draw
in which we will have two possible winners, this is, one person will take one of the papers, and then
we put it inside the bag again and some other person takes out a paper again. Papers have values
of 0 euros, 500 euros and 1000 euros. How will the average of these samples of size 2 behave in our
population of 3 papers? Which is the most probable result?
The possibilities we have are the following:

(0, 0), (0, 500), (0, 1000), (500, 0), (500, 500), (500, 1000), (1000, 0), (1000, 500), (1000, 1000).

We calculate the average for all of them and we check the probability that they appear:

X =average 0 250 500 750 1000


Probability 1/9 2/9 3/9 2/9 1/9

The average of the population (0, 500, 1000) is 500 and the variance is 166666.b
6 while for the
random variable sample average, the average is 500 and the variance is 83333.b 3. As we can see,
they have the same average but the variance of the sample average is lower. If we represent the
distribution of the sample average, we see that:

looks like a normal distribution.

The Central Limit Theorem comes to confirm the fact we have noticed in the example above.
Given a random variable with average µ and variance σ 2 , the distribution of the averages of√the
samples, as n (sample size) increases to infinity tends asymptotically to a distribution N (µ, σ/ n).
In the example above we have presented all the possible samples, but imagine what it would be
to calculate the same with all the possible samples of 60 students out of the 544 students of the
high school. It would be a neverending calculation. That is why it is usually used the Montecarlo’s
method, which consists in simulating, through tables of random numbers o through a computer, the
fulfillment of a great number of samples, and we calculate for them the value of the statistic, and
with that, we can get an approximate probability distribution (the bigger the number of samples
generated, the more approximate it is).

4
1.3 Point estimation. Sample distribution of the main esti-
mators
We come to the point now in which we need to estimate the parameters of the population. We
want to know the average and variance of the height and pocket money of the students of the high
school.We can take as an estimation the value of the sample average and the sample variance for our
sample of 60 students of the 544. In this case, we will make point estimation, because we estimate
the value of the unknown parameter through only one value of the estimator.
But now, does it coincide the average of the distribution of our estimator with the value of the
population parameter? For instance, the average of the sample variance does not coincide with
the population variance, so it will not be a good estimator for the variance. Is the value of the
estimation closer to the parameter if we increase sample size? These and other properties we want
to have them in an estimator.
When we want to estimate the value of a population parameter we want the estimator to have
a certain number of properties in order to get a ”good” estimation:
• Centered or unbiased: the average of the sample distribution of the sample statistic coin-
cides with the unknown population parameter.
• Consistent: if we increase the size of the sample, the average value of the sample distribution
of the sample statistic converges to the estimated parameter.
• Efficient: that it has the minimum variance of all the centered estimators.
• Sufficient: that it uses all the information about the parameter provided by the sample data.
Let x1 , x2 , . . . , xk be a random sample of a population. The most common estimators are:
For the population average, the sample average:
Pk
xi
x = i=1 .
n
For the population proportion, sample proportion:
observed values of A
p= .
sample size
For the population variance, sample cuasivariance:
Pk
n (xi − x)2
Sc2 = S 2 = i=1 ,
n−1 n−1
as we have already mentioned, we will not use sample variance because it is an not unbiased
estimator of the population variance.
It can be proved that if we have a random variable X from a population with average µ and
standard deviation σ, we have that:
• In sampling with replacement or in an infinite population
σ
x has µ as average value and √ as standard deviation
n

5
• In sampling without replacement or a finite population:
r
σ N −n
x has µ as average value and √ as standard deviation
n N −1

We can notice that the only difference between the case of infinite population and sampling with
replacement and the case of sampling without replacement
q and finite population is that the standard
−n
deviation is multiplied by the correction factor N
N −1 , where N is the size of the population and
n is the size of the sample.
Moreover, if we can prove that X is a random variable with a normal distribution having µ as
average and σ as a known standard deviation, we have that:
σ
x follows a distribution N (µ, √ ).
n

As we can see in the expression of the sample average, the bigger the size of n is, the lower the
standard deviation is, so the lower the error committed when we consider the sample average as an
estimator for the population average. We will have to check how convenient it is to increase the
sample size, fitting to the economic budget we have.
It can be proved that if X is a random variable with a normal distribution with µ as an average
and σ as standard deviation, we have that
x−µ
Sc
has a distribution t-Student with (n − 1) degrees of freedom.

n

In case that the sample size is greater than 30, t-Student distribution can be approximated by
a N (0, 1) distribution.
It can be proved that if X has a normal distribution with µ as an average and σ as standard
deviation, we have that:

(n − 1)Sc2 nS 2
= has a distribution χ2 (n − 1) chi-square with n − 1 degrees of freedom.
σ2 σ2

1.4 Interval estimation


We have already said that we can make point estimation for the height of the students of the high
school calculation the sample average of the 60 students who have been chosen. But we do not
know is just valid or we have a certain error. We say that the value of the population average is
”near” to the sample average but, what does ”near” mean? can I get two values and say that the
real value is ”almost sure” between them? If I want to have a 90% or 95% of certainty that the
average is inside some region, can I build that region?
In this section we will see what ”interval estimation” means. It consists in providing an interval
in which we can find the population para meter with a certain confidence.
We suppose from now on that our populations are normal unless we say the contrary.
The notation we will use in this section is the following:

6
µ = population average, σ = population standard deviation,
N = size of the population, n = size of the sample,
x = sample
Pn average,2 Pn proportion2 (q = 1 − p),
p sample
2 (xi − x) 2 (xi − x)
S = i=1 sample variance, Sc = i=1 sample cuasivariance.
n n−1

Moreover,

• zα is the value of a variable N (0, 1) which leaves an area (probability) α on its right side.
• tα (n − 1) is the value of a variable t-Student with (n − 1) degrees of freedom that leaves an
area (probability) α on its right side.
• χ2α (n − 1) is the value of a variable chi-square with (n − 1) degrees of freedom that leaves an
area (probability) α on its right side.
• Fα (m, n) is the value of a variable F of Snedecor with (n, m) degrees of freedom that leaves
an area (probability) α on its right side.

The most used values of zα are:

Values of α 0.0005 0.005 0.01 0.02 0.025 0.05 0.1


zα 3.29 2.575 2.3263 2.05374 1.96 1.645 1.2815

So, from now we search an interval (a, b) so that the unknown population parameter can be
found inside the interval with a certain precision or confidence level. To find that interval we use
the data provided by the sample, so this means we will find different intervals for different samples.
The concept of confidence level (for instance 90%) refers to the fact that if we consider a big
number of samples, and for each of these samples we calculate the confidence interval for a certain
unknown parameter h, this parameter is inside at least 95% of these intervals.
This fact is very important, so that when we build our confidence interval for a certain sample,
we should not make the mistake of thinking that ”the population parameter is inside the interval
with a probability of 0.95” because this is a wrong interpretation. The interval is random before
calculating the value of the statistic for each sample, once it is calculated for a concrete sample, it
is not random anymore and it contains or it contains not the parameter.
To build the confidence interval of a population parameter θ, we start considering an estimator
θb (generally unbiased), and from them we start the construction of the interval with a certain size
λb, so that the interval will be something like (θb−λb, θb+λb), with the condition that the probability
that the unknown parameter θ is inside the interval is 1 − α, this is

P [θb − λb ≤ θ ≤ θb + λb] = 1 − α.
The term λb is the margin for the error or precision of the estimation of the unknown population
parameter, it is usually called typical error of estimation or standard error.
We are going to give now in a detailed way the confidence intervals for a confidence level 1 − α
according to the different situations and population parameters for which we want to calculate those
intervals.
For the case of one population and sampling with replacement, we have:

7
Population Parameter  Interval 
σ σ
N (µ, σ), σ known µ x − z α2 √ , x + z α2 √
 n n 
Sc Sc
N (µ, σ), σ unknown µ x − t α2 (n − 1) √ , x + t α2 (n − 1) √
n !n
Pn 2
Pn 2
i=1 i (x − µ) (x
i=1 i − µ)
N (µ, σ), µ known σ2 ,
χ2α (n) χ21− α (n)
2 2 !
(n − 1)Sc2 (n − 1)Sc2
N (µ, σ), µ unknown σ2 ,
χ2α (n − 1) χ21− α (n − 1)
2 r 2 r !
p·q p·q
B(n, p), n > 30 p p − z α2 , p + z α2
n n

In case that we make sampling without replacement r or the population is not infinite, in general
N −n
we should multiply the standard error by the factor .
N −1
As we can see, the structure of the confidence interval is (θb − λb, θb + λb) where θb is an estimator
of the population parameter we want to calculate the interval for, λ is a value (critic point) of a
well-known distribution and b depends on the size of the sample n.
For instance, if we have  a normal distribution with  known standard deviation, a confidence
σ σ
interval for the average is: x − z α2 √ , x + z α2 √ where λ is in this case z α2 , a critic point
n n
of N (0, 1) that leaves on its right side an area of α/2, b would be √σn and the estimator of the
population average, θb would be x in this case.
For the case of two independent populations (and random sampling with replacement), we
have:
Population parameter Interval
s
σx2 σy2
N (µ, σ), σ known µx − µy (x − y) ± z α2 +
nx ny
s s
2
(nx − 1)Scx + (ny − 1)Scy2
1 1
N (µ, σ), σ unknown equal µx − µy (x − y) ± t α2 (nx + ny − 2) +
nx + ny − 2 nx ny
s
px q x py q y
B(n, p), n > 30 px − py (px − py ) ± z α2 +
nx ny
2
 2 2

σx Scx 1 Scx 1
N (µ, σ), µ unknown ,
σy2 2 F α (n − 1, n − 1) S 2 F
Scy 2 x y cy 1− 2 (nx − 1, ny − 1)
α

For samples which are big enough we can consider as valid the intervals built applying the
previous expressions.
In case that we are not in any of the cases above, we still can build a confidence interval for the
population average of any population applying Tchebycheff theorem:
Let X be a random variable with average µ and standard deviation σ. It holds that for any
value of k > 0, P (|X − µ| ≥ kσ) ≤ k12 .
In case we have that x is the sample average, we would have an interval for the population
average with known standard deviation:

8
 
σ σ
x − k√ , x + k√ .
n n
Therefore, the steps we have to follow to build a confidence interval could be the following:

1. Establishing the population and the distribution law of that population.


2. Fixing two of the following data: confidence level, sample size or estimation error.
3. Considering the appropriate estimator for the population parameter for which we want to
calculate the confidence interval. Calculating the value of that estimator.
4. Considering the critic point of the estimator distribution and applying the appropriate ex-
pression for the confidence interval.

1.4.1 Estimation errors and sample size


We have been supposing until now that the sample size is known. Nevertheless we have to determine
the sample size taking into account that the bigger the sample size is, the lower the error we make
by estimating is, because we would be closer to study the whole population. But the usual thing is
that the economic cost of sampling, the time we can use and some other factors will not allow us
to increase sample size as much as we would like to. On the other hand, a very small sample size
could lead us to not to get the expected confidence level.
We can consider that when we estimate the value of a population parameter through a confidence
interval, we get a typical estimation error equal (in absolute value) to λb which generally depends
on n. This error also measures the deviation of the estimator and the value of the critic point
determined by the distribution of the estimator.
So we can:

1. Fix the confidence level and the error we would like to have, calculating the appropriate
sample size needed.
2. Calculate the error we make with given sample size and confidence level.
3. Calculate the confidence level we can have with given sample size and the error we want to
commit.

1.5 Hypothesis tests


We all have intuitions about what could happen inside a certain population. For instance, someone
could think that the average of the pocket money of the students of the high school is greater than
5 euros. Or that in general, the average of the height is lower than 1.80m. But we can also make
ourselves concrete questions: do you think that we can consider that 10% of the students of the
high school are left-handed? or on the contrary, do you think that it is more correct to say that
there are less than 10% of left-handed students? This kind of questions can be answered through a
parametric hypothesis test.

9
If we really want to check whether these questions are right or wrong, we should measure each
and every element of the population. But as usual, this will not be in general a real possibility, so
we have to try to answer all those questions using the data we have from our sample.
An hypothesis test allows us to accept or reject if some statements are true or false depending
on some data we get through a sample.
Obviously this leads us to assume that the conclusion we get to may not be true, therefore we
should try to assure a certain degree of precision for the case in which we accept the hypothesis
that is posed. The precision degree is what we call confidence level.
We can have 2 main types of hypothesis tests:

• The ones which pose hypothesis about the parameters of the probability distribution of the
population. For instance, that the average of a normal population is equal to 7. We will call
them parametric tests.

• The ones posing other types of hypothesis. For instance, that a certain population has a
normal distribution or that there is no dependence between variables height and pocket money
of the students of a high school. We will call them non-parametric tests.

Once we have applied an hypothesis test and we accept the initial hypothesis, this does not
mean that we have proved (in the mathematical sense) the statement, because we have not checked
all the elements of the population and even the hypothesis could have been rejected with the data
of other sample. What we can say is that we cannot reject the statement with the data that we
have.
¿From now on we are going to present parametric tests (on the average, variance and proportion)
as well as non-parametric (homogeneity or heterogeneity of the population and independence in
contingency tables).
We need previously some concepts:

• Null hypothesis: we denote it by H0 and it represents the statement on the population


parameters that we pose. For instance, H0 : µ0 = 5 in a normal population, this is, we would
like to test if the average of a normal population is equal to 5.
• Alternative hypothesis: it represents the opposite statement to H0 . It is denoted by H1 .
In the example above, it would be H1 6= 5, this is, that the average is not equal to 5.
• Test Statistic: it is a function of the sample data that will allow us to decide whether to
accept or reject H0 . The probability distribution has to be known under H0 .
• Acceptance region: set of values (interval) for the sample statistic which makes us decide
to accept H0 with a probability of 1 − α if the null hypothesis is true.
• Rejection region: set of values (interval) complementary to the one above, with probability
α.
• Confidence level: it represents the probability that we want to have of accepting H0 when
it is true. It is denoted by 1 − α. It gives us the probability of the acceptance region under
the null hypothesis.

10
• Significance level: it represents the probability of rejecting H0 when it is true and it is
complementary to the confidence level, this is, α. It gives us the probability of the rejection
region under the null hypothesis.

We can also make a classification of the parametric tests as follows:

• Bilateral tests: null hypothesis is presented in such a way that population parameters are
univocally determined. For instance, the average is equal to 5, or the variance is 3.
• One-sided tests: null hypothesis is presented in such a way that the values of the unknown
population parameter is inside a semi-open interval. To know the distribution of the sample
statistic, we will suppose that the population parameter takes the value of one of the limits of
the interval. For instance, H0 : µ0 ≥ 3, this is, the average is greater or equal than 3, against
the alternative hypothesis H1 : µ0 < 3. Once we have to determine the distribution of the
statistic we will suppose that the value of the statistic under the null hypothesis is µ = 3

When we apply the hypothesis test we use the values of an statistic whose probability distribution
should be known under null hypothesis. Then the sample data can lead us to two types of errors:

• Error of type I: this is the error produced when we reject null hypothesis being true. The
probability of that decision is the significance level α, this is, the probability of rejecting null
hypothesis being true.
• Error of type II: this is the error produced when we accept the null hypothesis being false,
which would be the same as rejecting H1 being true. The probability of rejecting alternative
hypothesis being true is denoted by β.
• Power of the test: it represents the probability of rejecting H0 when H1 is true.

We can make a summary of the decisions taken and the errors made in the following table:

Decision/Reality H0 true H1 true


Accepting H0 Right decision (1 − α) Wrong decision. Error type II (β)
Rejecting H0 Right decision. Error type I (α) Right decision. Power (1 − β)

The probabilities of errors type I and II are complementary functions between them, in the sense
that increasing one, the other one decreases and viceversa, so we should try to minimize the error
that we consider to be more relevant and this would mean increasing the other one. A possible
solution consists on searching the appropriate sample size which makes compatible the levels of
errors of type I (α) and type II (β), this is, fixed one of the errors, we fix the sample size so that
the other is inside the desired limits.
The steps to follow to make a hypothesis test are:

1. To establish the distribution of the population, null hypothesis H0 and alternative hypothesis
H1 .
2. To fix the confidence level, 1 − α, and the sample size, n.
3. To select a sample and to calculate the value of the corresponding statistic, whose distribution
will be known under H0 .

11
4. To determine acceptance region and rejecting region.
5. To accept H0 if the value of the statistic is inside the acceptance region. In other case, H0 is
rejected.

6. Statistic conclusions.

In the following table we can see the different statistics which are normally used, as well as the
critic regions depending on the type of test applied. For the case of only one population:
Population H0 H1 Statistic Critic region
µ = µ0 µ 6= µ0 |T | ≥ z α2
x − µ0
N (µ, σ), σ known µ ≥ µ0 µ < µ0 T = σ T < z1−α

n
µ ≤ µ0 µ > µ0 T > zα
µ = µ0 µ 6= µ0 |T | ≥ t α2 (n − 1)
x − µ0
N (µ, σ), σ unknown µ ≥ µ0 µ < µ0 T = Sc
T < t1−α (n − 1)

n
µ ≤ µ0 µ > µ0 T > tα (n − 1)
σ = σ0 σ 6= σ0 |T | ≥ χ2α (n)
Pn 2

i=1 (xi − µ)2


N (µ, σ), µ known σ ≥ σ0 σ < σ0 T = T < χ21−α (n)
σ02
σ ≤ σ0 σ > σ0 T > χ2α (n)
σ = σ0 σ 6= σ0 |T | ≥ χ2α (n − 1)
2
(n − 1)Sc2
N (µ, σ), µ unknown σ ≥ σ0 σ < σ0 T = T < χ21−α (n − 1)
σ02
σ ≤ σ0 σ > σ0 T > χ2α (n − 1)
p = p0 p 6= p0 |T | ≥ z α2
p − p0
B(n, p) p ≥ p0 p < p0 T =q T < z1−α
p0 (1−p0 )
n
p ≤ p0 p > p0 T > zα
For the case of 2 populations, we have:

12
Populations H0 H1 Statistic Critic region
µx − µy = a µx − µy 6= a |T | ≥ z α2
x−y−a
N (µ, σ) µx − µy ≥ a µx − µy < a T =q 2 T < z1−α
σx σy2
nx + ny
σ known µx − µy ≤ a µx − µy > a T > zα
µx − µy = a µx − µy 6= a |T | ≥ t α2 (nx + ny − 2)
N (µ, σ) µx − µy ≥ a µx − µy < a T = r x−y−a
2 +(ny −1)S 2
(nx −1)Scx q T < t1−α (nx + ny − 2)
cy 1
nx +ny −2 nx + n1y
σ unknown equal µx − µy ≤ a µx − µy > a T > tα (nx + ny − 2)
σx2 = σy2 σx2 6= σy2 T > nnxy F α2 (nx , ny ) ó T <
nx
ny F1− 2 (nx , ny )
α
Pnx
(xi − µx )2 nx
N (µ, σ) σx2 ≥ σy2 σx2 < σy2 T = Pi=1
ny 2
T < ny F1−α (nx , ny )
i=1 (yi − µy )
µ known σx2 ≤ σy2 σx2 > σy2 T > nnxy Fα (nx , ny )
σx2 = σy2 σx2 6= σy2 T > F α2 (nx − 1, ny − 1) ó
T < F1− α2 (nx − 1, ny − 1)
2
Scx
N (µ, σ) σx2 ≥ σy2 σx2 < σy2 T = 2
T < F1−α (nx − 1, ny − 1)
Scy
µ unknown σx2 ≤ σy2 σx2 > σy2 T > Fα (nx − 1, ny − 1)
Let us recall that in case we are searching for the values of tα (n − 1) with values of n greater
than 30, this distribution is approximated by N (0, 1), and so we will look for the values of zα .

1.5.1 Relationship between confidence intervals and hypothesis tests


When we apply a bilateral hypothesis test and we reject null hypothesis, we do not really know
which value we can assign to the unknown parameter, we just know that there is a value we cannot
consider to be true, with a certain confidence level.
Sometimes, instead of applying an hypothesis test we can build the confidence interval for the
parameter and then rejecting all the all those null hypothesis of type: H0 : parámetro = k0 if the
value k0 is not inside the confidence interval we build.
For one-sided hypothesis:

• H0 : parameter ≤ k0 against H1 : parameter > k0 with a significance level of α, we build a


confidence interval for the unknown population parameter with a confidence level of 1 − 2α.
If the value k0 is greater than the higher limit of the interval, we reject hypothesis H0 . In
general, we reject any null hypothesis stating H0 : parameter=a greater value than the higher
limit of the interval.
• H0 : parameter ≥ k0 against H1 : parameter < k0 with a significance level of α, we build a
confidence interval for the unknown population parameter with a confidence level of 1 − 2α. If
the value k0 is lower than the lower limit of the interval, we reject hypothesis H0 . In general,
we reject any null hypothesis stating H0 : parameter=a lower value than the lower limit of
the interval.

We start now presenting non-parametric hypothesis tests. The tests we will study from

13
now on are based on chi-square distribution. We will see tests about the adjustment of a theoretic
distribution to an empiric distribution as well as the application to contingency tables.

1.5.2 Chi-square test for adjustment to a distribution


Let us suppose that we have a population and a variable X which presents different possibilities
x1 , x2 , . . . , xk excluding among them, with their probabilities p1 , p2 , . . . , pk . We have a sample of
size n in which we measure variable X and what we ask ourselves how much it could fit to a certain
theoretic distribution already known.
Independently of the theoretic distribution we consider, there will always be some gap between
the observed values an the expected ones. The problem is that we should decide whether this gap
is due to random or that the data just does not fit.
We will use the following notation:
Oi = number of elements of the sample with value xi . Pk
pi = theoretic probability that random variable takes value xi , holding that i=1 pi = 1.
If we have a sample Pof size n, the number of elements that we can expect taking value xi is
k
ei = npi , holding that i=1 npi = n.
We can build the following table:
Variable X x1 x2 ... xk
Observed frequencies O1 O2 ... Ok
Expected Frequencies e1 e2 ... ek
We will have the following null and alternative hypothesis:
H0 : empiric distribution fits to theoretical distribution.
H1 : we reject that it fits.
It is obvious that if we accept null hypothesis (we accept that it fits), gaps between observed
values and expected values are due to random and we can say that we have no evidence to reject
the hypothesis; in other case, we would say that there are significative differences between the two
distributions for the significance level fixed, and we cannot assign these differences to random.
The statistic used for the contrast would be:
k k
X (Oi − ei )2 X O2 i
T = = − n.
i=1
ei i=1
ei
Pearson proved that the distribution of this statistic is a chi-square with k −1 degrees of freedom
in case that there are no discrepancies between observed values and expected ones.
We accept H0 if T < χ2α (k − 1) ACCEPTANCE REGION.
We reject H0 if T ≥ χ2α (k − 1) REJECTION REGION.
To apply the tests in a proper way, we have to make the following considerations:
1. Expected frequencies for the different values should be greater than 5; in case they are not,
we should group them into classes so that the new frequencies are greater than 5. This means
changing the theoretic distribution and losing some information.
2. If we need to estimate p parameters, then the degrees of freedom of the chi-square are k − p
if they are independent and k − p − 1 if they are not.
3. It can be applied to discrete and continuous distributions.

14
1.5.3 Dependence and independence tests
We want to know if two variables X and Y from the same population are dependent or independent.
We suppose that the possible values of the variables are:

X : x1 , x2 , . . . , xk ,
Y : y1 , y 2 , . . . , y m ,
and we have a sample of size n, where we measure those variables X e Y .
We denote:
Oij = number of elements presenting values xi and yj .
eij = number of expected elements that present values xi and yj if variables are independent.
We can build the following continency table where empiric and theoretic frequencies appear:

X/Y y1 ... yj ... ym Absolute frequencies X


x1 O11 |e11 ... O1j |e1j ... O1m |e1m Ox1
... ... ... ... ... ... ...
xi Oi1 |ei1 ... Oij |eij ... Oim |eim Oxi
... ... ... ... ... ... ...
xk Ok1 |ek1 ... Okj |ekj ... Okm |ekm Oxk
Absolute frequencies Y Oy1 ... Oyj ... Oym n

To calculate theoretic frequencies we can use the following expression, provided that the variables
are independent:

Oxi Oyj (total de la fila i) · (total de la columna j)


eij = pij n = n= ,
n n n
i = 1, 2, . . . , k j = 1, 2, . . . , m.
We will pose null and alternative hypothesis as follows:
H0 : X and Y are independent.
H1 : X and Y are not independent.
If we accept null hypothesis, we can consider that we do not have evidences that could make us
suppose that there is a certain dependence of the two variables with a confidence level of 1 − α.
We consider the statistic for the test:
k X m k X m 2
X (Oij − eij )2 X Oij
T = = − n.
i=1 j=1
eij e
i=1 j=1 ij

This statistic has a distribution chi-square with (k − 1)(m − 1) degrees of freedom in case that
the variables are independent and with a confidence level of 1 − α.
We accept H0 if T < χ2α (k − 1)(m − 1) ACCEPTANCE REGION.
We reject H0 if T ≥ χ2α (k − 1)(m − 1) REJECTION REGION.

15
1.5.4 Homogeneity test for several samples
We try to decide if several samples measuring the same characteristic A belong or not to the same
population, with respect to the same characteristic A.
Let us suppose that we have k samples of sizes n1 , n2 , . . . , nk , being y1 , y2 , . . . , yk the elements
of each sample presenting a certain characteristic A and the rest do not present it.
If we suppose that all the samples come from the same population, the proportion of elements
presenting characteristic A are:
y1 + y 2 + · · · + y k
p= .
n1 + n2 + · · · + nk
If we suppose that the samples come from the same population, the expected values for charac-
teristic A in each sample are n1 p, n2 p, n3 p, . . . , nk p.
We can build the following table in which we present the observed and expected values:

Present characteristic A Do not present characteristic A Size


Samples Expected with characteristic A Expected without characteristic A of the samples
First y1 n1 − y 1 n1
sample n1 p n1 (1 − p)
... ... ... ...
ith yi ni − y i ni
sample ni p ni (1 − p)
... ... ... ...
kth yk nk − y k nk
sample nk p nk (1 − p)

We consider the following null and alternative hypothesis:


H0 : all the samples come from the same population.
H1 : we reject that they come from the same population.
If we accept null hypothesis, we can consider that the samples come from the same population
and the gaps between the observed values and the expected ones are due to random.
The statistic to be used is:
k
X (yi − ni p)2
1
T = .
p(1 − p) i=1 ni
The distribution for the statistic above is a chi-square with k − 1 degrees of freedom in case we
have no discrepancies between the observed and expected values for a confidence level of 1 − α.
We accept H0 if T < χ2α (k − 1) ACCEPTANCE REGION.
We reject H0 if T ≥ χ2α (k − 1) REJECTION REGION.
In case that the elements of the population are classified into more than 2 categories, the analysis
is made as in the case of an independence or dependence test between variables, where the table to
be used would be similar to the one above, in rows we would have the samples and in columns the
different categories. The statistic would be the same than in the case of independence of variables
and the expected values would be calculated in the same way and the null hypothesis is H0 : all
the distributions are homogeneously distributed.

16
When we want to analyze a population, we have to take into account that if the population can
be divided into subpopulations so that they keep heterogeneity of the starting population, if we do
not take it into account we can have mistaken results.
Let us consider, for instance, the following data which are related about students in a high
school admitted in some seminars:
N of applications N of admitted Admitted proportion
Men 1000 470 0.47
Women 1000 570 0.57

If we suppose that the population is homogeneous, we will get to the conclusion that there is a
significative difference between men and women, in favor to women, when they have to be admitted
for a seminar.
But if the data are analyzed depending on the seminar A, B or C, we get the following table:

N of applications N of admitted Admitted proportion (%)


Men 150 112 74.66666
Seminar A Women 400 280 70
Men 350 70 70
Seminar B Women 50 8 20
Men 500 288 57.6
Seminar C Women 550 282 51.272727

As we can see, discrimination is in favor to men in every seminar. Therefore, conclusions would
be different if we group the data. This situation is known as Simpson paradox.

1.6 Bayesian inference


In the beginning of the chapter we have seen that we can focus inference in some other way that
considers some a priori probabilities from which we calculate some other a posteriori probabilities.
We are going to make a summary of the main techniques of these methods.
Bayesian inference method is based on Bayes theorem, in which, from some a priori probabilities,
we calculate a posteriori probabilities; it supposes that the population parameter is not an unknown
constant but a random variable with known distribution law.
The estimation procedure starts from a priori knowledge and data derived from previous ob-
servations, so that when we take a new sample we estimate again the parameters updating the
previous values with the new ones.
Classic and bayesian methods are quite similar if sample size is big enough, or if a priori in-
formation is almost null, even more, they can get to exactly the same conclusion; nevertheless, for
small sample sizes they can get to completely different conclusions.
In general, bayesian methods are more complicated than classic ones though more satisfying
in many situations. They provide smaller intervals, point estimations more reliable and more
appropriate hypothesis tests.
For instance, given a normal population, N (µ0 , σ0 ), if we take a sample of size n and we calculate
sample average, parameters of the population are updated with the new values calculated through
the following expression:

17
 
1
µ + σn2 x
s
σ02 0 x
1
N (µ1 , σ1 ) = N  1 , .
σ02
+ σn2 1
σ02
+ n
2
σx
x

In general, a posteriori average is a combination of a priori average and a posteriori average of


the sample: µ1 = Kµ0 + (1 − K)x.
In the same way, we can apply bayesian inference to calculate confidence intervals and parametric
hypothesis tests adding the information obtained through the sample to the final expression.
For instance, a confidence interval for the average of a normal population with known standard
deviation would be (µ1 − zα/2 σ1 , µ1 + zα/2 σ1 ) where µ1 and σ1 are the values previously calculated.

18
Chapter 2

An example of application of
inference

2.1 Case of one population


Let us suppose that we have a sample of 25 students of our population, a high school with 558
students. We want to make an study to fulfill 3 goals:

1. One of the things we want to do is to make t-shirts of the high schools and sell them to the
student to organize a trip. We will use our data to find a confidence interval for the average
of the pocket money of the students because it can give us orientation about how high the
price can be so that the students can afford it.
2. Last studies say that young people devote most of their spare time to connect to the internet
and watch television. Can we say that the students of the high school devote daily more than
one hour to be connected to the internet?
3. We want to see if in our population we can consider to be true the data usually believed that
there are a 10% of the population that is left-handed.

We have then data about 25 students referring to the variables mentioned above. The data are
the following:

19
Observation Pocket money Internet Left-handed
1 0 0 0
2 12 10 0
3 12 10 0
4 5 90 0
5 8 90 0
6 8 0 1
7 0 30 0
8 40 60 0
9 21 0 1
10 0 60 0
11 9 45 0
12 4.5 15 0
13 20 0 0
14 0 30 0
15 15 60 0
16 0 30 0
17 0 0 0
18 0 0 0
19 12 30 1
20 9.4 60 0
21 10 60 1
22 2 120 1
23 5 90 0
24 3.5 150 0
25 10 60 0

Let us solve the problems we have posed. We start by our first goal:
Confidence interval for the average pocket money
We start by searching the limits where we can find the average pocket money. The first thing
to be done is to fix a confidence level. We will fix our confidence level as 90%
In which situation are we in? We suppose that our population is a normal population. Do we
know σ? The answer is no. Therefore, we have a normal population with unknown σ. We recall
that the confidence interval for the average in this situation is:
 
Sc Sc
x − t α2 (n − 1) √ , x + t α2 (n − 1) √ ,
n n
for the case of sampling withq replacement. As we have made sampling without replacement, we
−n
will apply the correction factor N N −1 , and so we have:
r r !
Sc N − n Sc N − n
x − t α2 (n − 1) √ , x + t α2 (n − 1) √ .
n N −1 n N −1
Therefore we need the following data:

20
x = 8.256, Sc = 8.895, t α2 (n − 1) = t0.05 (24) = 1.711.
and the interval would be

r r !
8.895 558 − 25 8.895 558 − 25
8.256 − 1.711 √ , 8.256 + 1.711 √ = (5.2785, 11.2335).
25 558 − 1 25 558 − 1

And what we get is that the appropriate limits would be 5.27 euros and 11.23 euros for the
t-shirts we want to sell.
Time devoted by young people to connecting to the internet
We ask ourselves now if we can state that the students of the high school devote daily more than
one hour to connect to the internet. Which technique can we use to get an answer to our question?
We will use a one-sided hypothesis test in which we will try to decide if the average of our variable
is greater than one hour (60 minutes).
Which is our situation now? We suppose again that our population is a normal population, and
again, σ is unknown. We have to choose now a confidence level, let it be 95%.
Null and alternative hypothesis for our test are:
H0 : average time daily devoted to the internet is greater or equal to 60 minutes.
H1 : average time daily devoted to the internet is lower than 60 minutes.
Our statistic is, knowing that σ is unknown:
x − µ0
T = Sc
,

n

and if we have that

x = 44, Sc = 40.224, tα (n − 1) = t0.05 (24) = 1.711.


then the statistic takes the following value:
44 − 60
T = 40.224 = −1.9888,
5
and critic region for the contrast is:

T < t1−α (24) = −1.711.


Therefore, our value is in the critic region, and this means that we should reject null hypothesis.
We cannot state that the students of the high school devote daily more than an hour to be connected
to the internet.
Proportion of left-handed students of the population
Now we will try to check if we can say that in our population we have a 10% of students who
are left-handed. We will answer again to this question using an hypothesis test. As we have now a
variable only taking values 0 and 1, we will not be in a normal population but we want to make a
test on the parameter p of a binomial distribution.
Null and alternative hypothesis for our test would be:
H0 : The proportion of left-handed students is equal to 0.1.

21
H1 : The proportion of left-handed students is not equal to 0.1.
We will make the contrast for a confidence level of 95%. We recall that the statistic to be used
is:
p − p0
T =q ,
p0 (1−p0 )
n

where

p = 0.2, p0 = 0.1, n = 25.


Therefore the value of our statistic is:
0.2 − 0.1
T =q = 1.b
6.
0.1(1−0.1)
25

And the critic region for the test is:

|T | ≥ z α2 = z0.025 = 1.96,
and then we cannot reject the hypothesis that in our high school there are 10% of the students
who are left-handed.

2.2 Case of two populations


Two students of the high school have got a sample, each one of his corresponding level, 5th and
4th level. They have measured, among some other things, and while they are observing that data
they have, the guy from the 4th level thinks that the students of his level are taller, because the
sample average is higher. The guy from the 5th level does not agree with this, he thinks that what
happens is that there is a higher variability in the 5th level population and that’s why the average
of the sample from the population is lower. Can we help them to find the right answer?
They give us their data, that are the following
For the 5th level, we have
187 161 169 168 170 165 173 160 175 158 175 164 158 161 158 171 175 170 185 158 163 160 169
158 155 168,
while for the 4th level, the data are
170 174 164 171 177 163 170 165 160 175 178 174 162 164 170 155 183 176 158 160 160 173 171
152 170.
What we are going to find the right answer is to make two hypothesis tests. We will pose in
the first one the hypothesis that what the guy from the 4th level says is right, and later on we will
pose that it may happen that the variance of the variable height of the students of the 5th level is
higher than the one of the 4th level. We will make all the tests at a confidence level of 95%.
Let us star by the second test, let us see if we can state that there is a variance higher than the
other. Our null and alternative hypothesis are:
H0 : The variance of the height of the students of the 5th level (σx2 ) is greater or equal than the
one of the students of the 4th level (σy2 ).

22
H1 : The variance of the height of the students of the 5th level (σx2 ) is lower or equal than the
one of the students of the 4th level (σy2 ).
We are in the case of two normal populations with unknown average, so our statistic will be:
2
Scx
T = 2
.
Scy
And as we have,
2 2
Scx = 66.982, Scy = 58.72,
then

T = 1.14,
and we have the following critic region

T < F1−α (nx − 1, ny − 1) = 0.50909,


thus we cannot reject that it is greater or equal. But the guy from the 5th level wants to know
if it is greater, not equal. If we take a look to the region for the two-sided test (σx2 = σy2 )

T < F1− α2 (nx − 1, ny − 1) = 0.44599 ó T > F α2 (nx − 1, ny − 1) = 2.2574,


and we get to the conclusion that we cannot reject the hypothesis that they are equal, se we
cannot assure the fact that the variance is strictly greater.
Let us make the test now for the average. We will suppose that σ is unknown but equal but
anyway in both situations (previous test stated that we cannot reject that hypothesis). Null and
alternative hypothesis in this case are:
H0 :The average of the height of the students of the 5th level (µx ) is lower or equal than the
one from the students of the 4th level (µy ) µx − µy ≤ 0,
H1 :The average of the height of the students of the 5th level (µx ) is greater or equal than the
one from the students of the 4th level (µy ) µx − µy ≤ 0,
In our case, the contrast statistic is:
x−y−a
T =q 2 +(n −1)S 2
,
(nx −1)Scx
q
y cy 1 1
nx +ny −2 nx + ny

and, apart from the previous data, we have that

x = 166.692, y = 167.8.
If we substitute
166.692 − 167.8 − 0
T =q q = −1.69957.
(26−1)66.982+(25−1)58.72 1 1
26+25−2 26 + 25

Then the critic region is

23
T > tα (nx + ny − 2) = 1.6766,
thus we cannot reject the hypothesis. But the truth is that if we pay attention to the two sided
test (µx − µy = 0) and its critic region

|T | ≥ t α2 (nx + ny − 2) = 2.0096.
we also cannot reject null hypothesis, so we cannot say that the students of the 5th level are
shorter than the ones of the 4th level.
Our conclusion is that no one of them is right, at least in the beginning. Differences between
averages and variances in the two populations are not significative.

24

You might also like