Download as pdf or txt
Download as pdf or txt
You are on page 1of 180

Inferential Statistics

-Hypothesis Testing &


Estimation
By
Alfred Ngwira
Inferential Statistics

 Will make conclusions about the


population parameters using sample
statistics. Precisely we will be
1. Testing hypothesis about population
parameters using sample statistics

2. Estimating population parameters using


sample statistics.
Hypothesis

 A statistical hypothesis is a
conjecture/claim about a population
parameter(eg population mean,
proportion) which may or may not be
true. E.g Proportion of girls at Bunda is
30%.
Hypothesis

 Statistical hypothesis testing is a decision-


making process for evaluating claims
about a population parameter using the
sample.
Hypothesis
Examples

1. The mean temperature at Salima town is


less than 35C.

2. The mean grade point average of graduating


students at a university is at least 2.3.

3. The mean income for LUANAR graduates


when employed is MK150000 per month.
Types of hypothesis

Null hypothesis

 Symbolized by Ho, is a statistical hypothesis


that states that there is no difference between
a parameter and a specific value, or that there
is no difference between two parameters.

 The null hypothesis contains equal sign E.g


2,3 on previous slide are null hypotheses.
Types of hypothesis

Alternative hypothesis
 Symbolized by H1, is a statistical hypothesis
that states that there is a difference between
a parameter and a specific value, or states
that there is a difference between two
parameters.
Types of hypothesis
 The alternative hypothesis usually contains
the symbol >, <, or ≠.

 E.g 1 is alternative hypothesis on previous


slide
Hypothesis testing
procedure

Step 1: Identify H0 and H1

H0 will contain =

H1 will contain > , < , or 

Step 2: Select the test statistic e.g z, t, f,


or chi-square, based on
distribution of sample statistic.
Hypothesis testing Procedure

Step 3: Use a given level of significance( 


-Type I error) to determine the
critical/rejection region(s)(use  to find
critical value/point). We use statistical
tables.
Hypothesis testing Procedure

 Eg for right tailed t test the rejection region


based on alpha( )
Hypothesis testing
procedure
Step 4: Calculate the test statistic from
sample data

Step 5: Make your decision. If the test


statistic falls in the critical region, reject H0.
If the test statistic does not fall in the
critical region, do not reject H0. Interpret
your decision in terms of the claim.
Hypothesis testing

Types of errors in conclusion made in

hypothesis testing
 Type I error(  ): error/probability of
rejecting Ho when it is not to be rejected).

 Type II error(  ): error/probability of failing


to reject(‘accepting’) Ho when it is false.
Hypothesis about the mean

 The following are possible hypothesis


formulations about the mean:
H 0 :   0
1 2 H 0 :   0 3 H 0 :   0
H1 :   0 H :   H1 :    0
1 0

 Note   population  mean


0  specific  hypothesised  value
Hypothesis about the mean
 First is two tailed, 2nd is right tailed and 3rd is

left tailed hypothesis formulation.

 Two tailed because there are two directions of


alternative (right/left)

 Right tailed because direction of alternative is


right and left tailed because direction of
alternative is left.
Hypothesis about mean

Example of two tailed hypothesis


formulation

Ho: Average barley tobacco yield in 2014


was 50 000 000 kg

H1: Average barley tobacco yield in 2014


was not 50 000 000 kg
Hypothesis about mean

Example of right tailed hypothesis


formulation

Ho: Average barley tobacco yield in


2014 was 50 000 000 kg

H1: Average barley tobacco yield in


2014 was more than 50 000 000 kg
Hypothesis about mean

Example of left tailed hypothesis


formulation

Ho: Average barley tobacco yield in 2014


was 50 000 000 kg

H1: Average barley tobacco yield in 2014


was less than 50 000 000 kg
Hypothesis about the mean

 Note the direction of alternative hypothesis


determines the type of hypothesis test i.e
whether two tailed, right tailed or left tailed.

 Two tailed hypothesis means that when


testing such hypothesis formulation, there
will be two rejection regions(to right and
left of the distribution of the test statistic)
Hypothesis about the mean

 Eg if use z test to test two tailed


hypothesis then the following are the two
rejection regions in the z distribution.
Hypothesis about mean

 A right tailed hypothesis test has a


rejection region only to the right of the
distribution of the test statistic i.e
Hypothesis about mean

 A left tailed hypothesis test has rejection


region to the left of the distribution of the
test statistic.
Hypothesis about mean
Activity

Consider the hypothesis formulation

below:

Ho: Mean maize yield per hec. was 8 bags in


2015/2016 versus

H1: Mean maize yied per hec. was less 8 bags


Hypothesis about mean

1. Determine whether the test of hypothesis


will be left tailed/right tailed/two tailed.

2. Determine direction of rejection

region(s)
Hypothesis about the mean

 Note that critical points marking rejection



regions in two tailed test are based on 2

while critical points marking rejection


regions in one tailed test are based on the
whole  .
Hypothesis about the mean
Activity
Mark T/F
1. In two tailed test there are two rejection
regions
2. To get the critical values marking
rejection regions in two tailed test we use
alpha  divided by two.
3. In one tailed test we get critical value
marking rejection region by using alpha .
Hypothesis about the mean

 Sample statistic to use to test such


hypotheses is sample mean( X )
1. If sample from normal population and know
population standard deviation  use z-test(1)

2. If sample size is large i.e n  30 and sample


from any population and standard deviation 
is estimated by sample S use z-test(2)
Hypothesis about the mean

3) If the sample from normal and that


sample size is small i.e n  30 and that
population standard deviation is estimated
by sample standard deviation then use t
test(3) with n-1 degrees of freedom.
x x x
1. z  2. z  3. t 
/ n / n s/ n
Hypothesis about the mean
Example
 Full-time PhD students receive an average
salary of MK12,837 according to the
Department of Education. The dean of
graduate studies at the university feels that
PhD students earn more than this.
Hypothesis about the mean

 He selects 44 students randomly and


finds their average salary is MK14,445
with a standard deviation of MK1,500.
With  = 0.05, is the dean correct?
Testing hypothesis about mean
Solution
Step 1 Stating null and alternative hypothesis:
H 0 :   Mk12,837
H1 :   Mk12,837
– Thus we have right tailed test. Our rejection region is
to the right. Ho will be rejected if sample mean will
be far to the right.
Hypothesis about the mean
Step 2 Sample statistic is sample mean X
and its distribution is approximately
normal since sample size is greater than
X 
30 i.e so use z- test statistic Z
/ n
after standardization.
Hypothesis test about the mean

Step 3 Critical value: we reject Ho when


z  z i.e when
z  z0.05  1.65
Hypothesis about the mean

Step 4 Calculating statistic using sample


X  14445  12837
data, Z   7.11
/ n 1500 / 44

Step 5 Conclusion: since z > 1.65, we reject


the Ho and adopt alternative, H1 i.e PhD
students earn more than MK12837 based
on available data.
Hypothesis about the population
mean
Example

A nutritionist believes 12 g box of


breakfast cereal contains an average 1.2 g
of bran. The nutritionist gets a random
sample of sixty boxes of popular cereal.
She find sample mean of 1.170g, and
standard deviation, s = 0.111g.
Hypothesis about the mean
Do the data indicate that the mean bran content

of all boxes of this brand of cereal differs from 1.2

g? Use α=0.05.

Solution
Step 1 Stating null and alternative hypothesis:

H 0 :   1.2
H 1 :   1.2
Hypothesis test about mean
Note that we have two tailed test

Step 2 Sample statistic is X and is normal since


X 
n>30 and its standard form is Z
S/ n
Step 3 Rejection criteria: we reject Ho when

| z | z / 2 i.e when
| z | z0.05/ 2  z0.025  1.96

i.e when Z  1.96 or Z  1.96


Hypothesis test about mean
 i.e when z is in the right rejection region or
left
Hypothesis test about mean
X  1.170  1.2
Step 5 Now Z    2.09
S/ n 0.111 / 60

Step 6 Conclusion, z < -1.96, we reject

reject null, Ho i.e the mean is

different from 1.2


Hypothesis test about mean

 Review
– Alpha,  is the type I error-probability of
rejecting null when it is not to be rejected.

– Alpha  is area of rejection region.


Hypothesis about the mean
 Review
Hypothesis test about mean

Review

 For right tailed test,  =p(Z ≥ critical


value/point), if use Z test or  = p(t ≥
critical value) if use t test.

 For a left tailed test  =p(Z ≤ - critical


value) for Z test.
Hypothesis test about mean

Review

 For a left tailed test =p(Z ≤ -critical


value) for Z test and  =p(t ≤ -critical) for
t -test.
Hypothesis test about the mean

Review on alpha ( )
Hypothesis test about mean-
Review
 For two tailed test there are two rejection
regions(right/left)

 The total sum of areas of the two regions


is 
 That’s each rejection region =  / 2 in
area or probability
Hypothesis test about mean
Review
 Two rejection rejections are alpha in area
Hypothesis test about mean

Review

 For a two tailed test, if you find a positive


test statistic compare it with positive
critical value and reject Ho if statistic is ≥
positive critical, otherwise fail to reject Ho
Hypothesis test about mean

 If you find a negative statistic, compare it


with negative critical value, and reject Ho if

statistic is ≤ negative critical, otherwise fail


to reject Ho(see e.g before this review)
Hypothesis test about mean

Example

The average rainfall during the summer


months for the southern of Malawi is 11.52
mm. A researcher selects a random
sample of 10 districts in the southern
Malawi and finds that the average amount
of rainfall for 2014 is 7.42 mm.
Hypothesis test about mean

The standard deviation of the sample is


1.3 mm. At  = 0.05, can it be concluded
that for 2014 the mean rainfall was below
11.52 mm?
Solution
Step 1 H 0 :   11.52
H1 :   11.52
Hypothesis about mean

Step 2 Since n<30 test statistic is t  X  


S/ n
with n-1 degrees of freedom

Step 3 Critical value. We reject Ho when


t  t ,n 1 i.e when

t  t0.05,9 i.e when

t  1.833
Hypothesis about mean
 i.e
.

Hypothesis about mean


X  7.42  11.52
Step 4 Now t   9.97
S/ n 1.3 / 10
Step 5 Conclusion: Since t is less than the

critical value, we reject Ho and adopt


H1 to say mean rainfall is below
11.52 mm.
Hypothesis about difference
between two means
 Possible hypothesis formulations
1) H :
0  
1  2
or 1   0
2

H 1 : 1   2 or 1   2  0
2) H 0 : 1   2 or 1   2  0
H1 : 1   2 or 1   2  0
Hypothesis about difference
between two means
3) H 0 : 1   2 or 1   2  0
H1 : 1   2 or 1   2  0
 Note (1) is a two tailed test while (2) & (3)
are one tailed tailed test hypothesis
formulations.
Hypothesis about difference
between two means
 Now appropriate statistic is difference
between sample means X 1  X 2 .

 If sample from normal or n1 , n2  30 , X 1  X 2

is also normal and thus by standardization


X 1  X 2  ( 1   2 ) X 1  X 2  ( 1   2 ) X 1  X 2
Z  
se( X 1  X 2 )  12  22  12  22
 
n1 n2 n1 n2
,
Hypothesis about difference
between two means
 Note if ,n1 , n2  30 and that don’t know
population variances  1 ,  2 we can
2 2

use sample variances 2


S ,S 2
1 2

and still use z-test statistic:


X1  X 2
Z
S12 S 22

n1 n 2
Hypothesis about difference
between two means
 Note if ,n1, n2  30 and that don’t know
population variances  1 ,  2 can use
2 2

2 2
sample variances S , S1 2
and use t-test
with smaller of n1  1 and n2  1
degrees of freedom: X1  X 2
t
S12 S 22

n1 n2
Hypothesis about difference
between two means
 Note that the just defined Z and t test for
the difference between two means
assume that the two population variances
are not the same i.e   
2 2
1 2

 If we assume population variances are


 2
equal 1 2   2
  2
then we have Z statistic
as:
Hypothesis about the difference
between two means
 Under equal population variance
assumption i.e 12   22   2
X1  X 2 X1  X 2
Z 
 2
 2
2 2
1
 2

n1 n2 n1 n2
X1  X 2 X1  X 2
 
1 1 
1 1

   
2

 n1 n2  n1 n2
Hypothesis about the difference
between two means
 If n1 , n2  30 and that  ,
1
2 2
2 are
2 2
estimated by S , S 1 2 then Z statistic is
X1  X 2
Z
S
1 1

where
n1 n2

S (n1  1)  S (n2  1)
2 2
S 1 2

n1  n2  2 is the pooled sample


standard deviation.
Hypothesis about difference
between two means
 If we assume that population variances
are same 12   22   2 then t-statistic is:
X1  X 2
t
1 1
S 
n1 n2
with n1+n2-2 degrees of freedom where
S12 (n1  1)  S 22 (n2  1)
S
n1  n2  2
is pooled sample standard deviation.
Hypothesis about difference
two means
Example Two types of fertilizers UREA
and CAN were applied to two maize plots
respectively. Farmers think that there is
no difference in maize yield between two
fertilizers. A researcher takes sample of
40 maize grains in plot 1 and 32 maize
grains in plot 2.
Hypothesis about difference
between two means
 The average weight of maize grains is
10kg in plot 1 and 7kg in plot 2. Standard
deviations for weights for plot1 is 2kg and
for plot 2 is 4kg. Test whether there is a
difference in maize yield for UREA and
CAN. Assume that population variances
are not equal.
Hypothesis about difference
between two means
Solution
Step 1 H 0 : 1   2
H 1 : 1   2
Step 2 Since n1 , n2  30 we use
X1  X 2
Z
S12 S 22

n1 n2
Hypothesis about difference
between two means
Step 3: Critical value: We reject Ho if
| Z | Z  / 2  Z 0.05 / 2  Z 0.025  1.96

i.e when Z  1.96 or Z  1.96


X1  X 2 10  7
Step 4: Now Z   3.87
2 2 2 2
S S 2 4

1 2

n1 n2 40 32

Step 5: Since Z  Z  / 2  1.96 we reject Ho.


Hypothesis about difference
between two means
Example

A farmer thinks that local chickens lay


eggs with lager weight than hybrid. She
collect 10 eggs for local and 12 eggs for
hybrid.
Hypothesis about difference
between two means
 The mean weight for local is found to be
5kg and that for hybrid is found to be 12kg.
The standard deviation for local weight is
2kg and that for hybrid is 3kg. Test the
claim for the farmer.
Hypothesis about difference
between two means
Solution
Data X 1  5, X 2  12 S1  2, S 2  3
n1  10, n2  12
Step 1 Hypothesis

H 0: 1   2
H 1 : 1   2
Hypothesis about difference
between two means
Step 2: Since sample sizes are less than 30
we use t test statistic i.e
X1  X 2
t with smaller degrees of
S12 S 22

n1 n2

n1  1 and n1  1.
.
Hypothesis about difference
between two means
 Note here we assume that two population

variances are not equal.

Step 3: Critical value; We reject Ho if t  t ,df


where df is smaller of n1  1 or n2  1

i.e reject Ho when t  t 0.05,9  1.833


Hypothesis about difference
between two means
X1  X 2 5  12
Step 4: Now t   6.19
S12 S 22 212 3 22
 
n1 n2 10 12

Step 5: Conclusion: since t  t 0.05,9 we fail to


reject null hypothesis, i.e based on the
available data 1  2
Hypothesis about difference
between two means
Example

Is there difference in return book times

between two university students?

LUANAR 2, 4.3, 8.5, 3, 2.

Mzuzu 3 , 6.5, 5, 7.5, 8, 4, 3 Assume

population variances are the same.


Hypothesis about difference
between two means
Step 2: Test statistic with assumption of
equal population variances is t since n1, n2  30
X1  X 2
t
1 1
S 
n1 n 2
with n1  n2  2 degrees of freedom where

S is pooled sample standard deviation.


(n1  1)S12  (n2  1)S 22
S
n1  n2  2
Hypothesis about difference
between two means
Step 3: Rejection criteria: Ho is rejected
whent  t / 2,n1 n2 2 or

t  t / 2,n1  n2 2
Hypothesis about difference
between two means
Step 4: Test statistic calculation
(5  1)7.33  (7  1)4.32
s  2.351
572
x1  x2 3.96  5.29
t 
1 1 1 1
s    2.351   
 n1 n2  5 7

 1.33
  0.966
1.3863
Hypothesis about difference
between two means
Step 5: Conclusion: since t  t 0.025,10 we

fail to reject Ho i.e there is no

difference between return book times

Between the two universities.


Hypothesis about the proportion

 Population proportion is the ratio of items


of interest to the total.

 Examples
– Ratio of males to total in statistics
class(400/635)

– Ratio of extension students to total(200/635) .


Hypothesis about population
proportion
R
P
 We denote population ratio by N

where R is the number of items of interest


and N is total population size and we
ˆ  r
denote sample ratio by P
n .

 Note r is the number items of interest from


the sample of n items/individuals.
Hypothesis about proportion

 To test hypothesis about population


R
proportion P
N
we use the sample
proportion ˆ  r
P .
n

• Note that the mean/expected value of ˆ  r


P
n
ˆ r r  r  P(1  P)
is E( P)  E n   p and variance is V ( Pˆ )  V    V   
n n n
Hypothesis about population
proportion
 Now for large samples i.e n ≥ 30 the
ˆ  r
sampling distribution of sample P
n

proportion is approximately normal with


P(1  P)
mean P and variance of n
so that
by standardization we have
Pˆ  P
Z
P(1  P)
n
Hypothesis about population
proportion
 That’s to test hypothesis about population
proportion we will assume large samples
so as to use the z-test.
Pˆ  P
Z
P(1  P)
n
Hypothesis about population
proportion
Example

An ABM marketing company claims that it


receives 4% responses from its mailing.
To test this claim, a random sample of 500
was surveyed with 25 responses. Test at
the α =0.05 significance level.
-

Hypothesis about population


proportion
H0 : p = 0.04

H1: p  0.04

 This is a two-sided rejection region test.

 Appropriate test statistic is


Pˆ  P
Z
P(1  P)
n
Hypothesis about population
proportion
 Now we reject Ho if Z  Z or Z  Z 
2 2

i.e when Z   Z 0.05   Z 0.025 or Z  Z 0.205  Z 0.025


2
i.e when Z ≤ -1.96 or Z ≥ 1.96
Pˆ  P 0.05  0.04
 Now Z   1.14
P(1  P) 0.04(1  0.04)
n 500
Hypothesis about population
proportion
 Conclusion: Since z=1.14<1.96, we fail to
reject the null hypothesis, i.e the claim of
ABM marketing company is likely to be
true.

 Note we used the Z-test since n ≥ 30 by


CLT.
Hypothesis about difference
between population proportions
 Possible hypothesis formulations:
H 0 : p1  p 2 or H 0 : p1  p 2  0
H 1 : p1  p 2 H 1 : p1  p 2  0

H 0 : P1  P2 H 0 : P1  P2  0
or
H 1 : P1  P2 H1 : P1  P2  0
Hypothesis about difference
between population proportions
(3) H 0 : P1  P2 H 0 : P1  P2  0
H 1 : P1  P2 H1 : P1  P2  0

 Appropriate statistic is the difference


between sample proportions i.e
ˆ P
P ˆ
1 2
Hypothesis about difference
between population proportions
 Note if sample from normal population or
that n1 , n2  30 then Pˆ1  Pˆ2 is normal with
P1 (1  P1 ) P2 (1  P2 )
mean P1  P2 and variance 
n1 n2

 That’s standardising Pˆ1  Pˆ2 we have z-test


( Pˆ1  Pˆ2 )  ( P1  P2 ) Pˆ1  Pˆ2
Z 
P1 (1  P1 ) P2 (1  P2 ) P1 (1  P1 ) P2 (1  P2 )
 
n1 n2 n1 n2
Hypothesis about difference
between two proportions
 Under assumption of equal population
Pˆ1  Pˆ2
proportions i.e, P1  P2  P we have Z
1 1
P(1  P)  
 n1 n2 

n1 Pˆ1  n2 Pˆ2
where P is the pooled sample
n1  n2

proportion based on two sample

proportions.
Hypothesis about difference
between proportions
Example

A farmer club in Mzuzu claims that the


proportion of rotten ground nuts in their
50kg bag is same as that of Mulli Brothers
limited.
Hypothesis about difference
between population proportions
A researcher collects 100 ground nuts
from a bag of farmer club and finds that
20% are rotten and collects 80 from Mulli
bag and finds that 12% are rotten. Test the
claim of famers club.
Hypothesis about difference
between proportions
 Data Pˆ1  0.20, Pˆ2  0.12 n1  100, n2  80
 Hypothesis: H 0 : p1  p 2
H 1 : p1  p 2
ˆ P
P ˆ
 Test statistic is Z  1 2

 1 1 
P(1  P)  
 n1 n2 
Hypothesis about difference
between population proportions
 i.e when Z ≥ 1.96 or Z≤ -1.96

 Now calculating test statistic


n1 Pˆ1  n2 Pˆ2 100  0.20  80  0.12
P   0.16
n1  n2 100  80

Pˆ1  Pˆ2 0.20  0.12


Z   1.44
1 1  1 1
P(1  P)   0.16(1  0.16)  
 n1 n2   100 80 
Hypothesis about difference
between population proportions
 Since Z<1.96 we fail to reject the null
hypothesis i.e based on the available data
there is no difference in proportions of
rotten groundnuts in the bags of famers
club and that of Mulli.
One way anova/comparing
more than two means
 Data lay out
Treatments/g Observations

roups

1 y11 y12 y13 … … y1n1

2 y21 y22 y23 … … y1n2

. . . . .

. . . . .

. . . . .

k yk1 yk2 yk3 … … yknk


One way anova/comparing
more than two means
MST
 Appropriate statistic is F
MSE
~ Fk 1, N k where

 n j x j  x 
k
2
which is Fk 1, N k
j 1
MST 
k 1

where nj is the sample size for

group j X j is sample mean for group j for


j=1,2,3,…,k, X is the grand/overall mean
One way anova/comparing
more than two means
 Mean square error(MSE) or within group

variation is

 n 1s  x  xj 
k k nj
2 2
j j j ,i
j 1 j 1 i 1
MSE  
N k N k

where S 2j is group j variance x ji is


an observation in group j column i.
One way anova/comparing
more than two means
 The null hypothesis is rejected when this
statistic is greater or equal to the critical

F-value, i.e when F  Fk 1, N  k or P( F )  
One way anova/comparing
more than two means
 Rejection criteria
One way anova/comparing
more than two means
Example
The following data are the weights in kgs of

patients after being given three diets.

Weight in kgs

Diet 1 210 215 205 180 175 190

Diet 2 180 160 195 190 170 155


One way anova/comparing
more than two means
Step 1: Hypothesis
Ho: no difference among
diets(µ1=µ2=…=µk)
H1: there is difference among
diets(atleast two diet means are
different)
MST
Step 2: Statistic to use is F ~ Fk 1, N k
MSE
One way anova/comparing
more than two means
Step 3: Critical value: Ho is rejected when

FF k 1, N  k F0.05
2,15  3.68
One way anova/comparing
more than two means
Step 4: Calculating statistic

n1  n2  n3  6, N  n1  n2  n3  18

x1  195.83, x2  175, x3  161.6, x  177.5.


One way anova/comparing
more than two means
Calculating the MST we have
 n x  x
k
2
j j
j 1
MST 
k 1


    
6 * 195.8  177.5  6 * 175  177.5  6 * 161.6  177.5
2 2
2

3 1
 1779 .4
One way anova/comparing
more than two means
Calculating F statistic we have
MST 1779 .4
F   8.2
MSE 216.91
Step 5: Conclusion, since the calculated
F=8.2 > the critical value=3.68, we reject
the null hypothesis i.e there is a difference
among the treatment means.
One way anova/comparing
more than two means
 The F test to compare means is based on
analysis of variance(ANOVA) of yij into
different sources i.e due to group, and due
to error i.e

Total variation=between group


variation+error variation(within group)
One way anova/comparing
more than two means
 Now total variability in yij is measured by
TSS, between group by SST, and error
variability by SSE. Thus analysis of
variance in data yij is summarized by

TSS=SST+SSE

 The degrees of freedom for TSS is N-1,


SST is k-1, SSE is N-k
One way anova/comparing
more than two means
 Now the F test to compare group means
compares variability due to group and
variability due to error by ratio
SST
F k  1 
MST
 Fk 1, N  k
SSE MSE
N k
One way anova/comparing
more than two means
 MST is mean square of between group
sum of squares, and MSE is mean square
of error sum of squares.

 MST is a measure of variabiity in data due


to group difference and MSE is a measure
of within group data variability.
One way anova/comparing
more than two means
 Ho is rejected when
F  Fk1, N k or when P( F )  

where  is type 1 error a.k.a significance


level.
One way anova/comparing
more than two means
 Summary of one way anova table and F
statistic
Source of Sum of Degrees of Mean F-value

variation squares(SS) freedom(df) square(MS)

Group/treat SST k-1 MST=SST/k-1 F=MST/MSE

ment

Error SSE N-k MSE=SSE/N-k

Total TSS N-1

 Note, TSS=BSS+SSE, and N-1=k-1+N-k


One way anova/comparing
more than two means
Example

Complete ANOVA table below and test as


to whether groups are equal or not
ANOVA
Source SS df MS F-value
Groups 291.8027 (b)_ 145.90 (c)_
Error (a)_ 50
Total 785.1908 52
One way anova/comparing
more than two means
Solution
ANOVA
Source SS df MS F-value
Groups 291.8027 (b) 2 145.90 (c) 14.78
(a)
Error 493.3881 50 9.87
Total 785.1908 52
One way anova/comparing
more than two means
a) subtract group sum of squares from
total since TSS=SST+SSE

b) subtract error df from total

c) Use the formula:

MSB 145.90
F   14.78
MSE 9.87
One way anova/comparing
more than two means
 Critical F-value: Falpha,k-1,N-k =F 0.05,2,50 =
3.183

 Now since F=14.78 > F 0.05,2,50 = 3.183 ,


we reject the null hypothesis i.e the

treatment/group means are different.


Testing for association in
cross tables
 Let X1 and X2 be categorical variables
with I and J categories/levels respectively
in the cross table/contingency table
Testing for association in
cross tables
 Cross table of two categorical variables
X2

Level 1 Level 2 Level 3 ….. …. Level J

Level 1

Level 2
X1
.

Level I
Testing for association in
cross tables
 We wish to test whether there is an
association between X1 and X2. The
following is the hypothesis formulation:

Ho: There is no association

H1: There is an association


Testing for association in
cross tables
 Alternatively you may state the
hypotheses as follows:

Ho: X1 and X2 are independent

H1: X1 and X2 are dependent


Testing for association in
cross tables
 One of the statistics to use is Pearson chi-
square defined as
 
2 O  E  2
~ 2
( I 1)( J 1)
E
 The null, Ho is rejected when

  2 2
( I 1)( J 1),
Testing for association in
cross tables
 The null, Ho is rejected when

 
2 2
( I 1)( J 1),
Testing for association in
cross tables
Example

Test whether there is association between

heart attack status and personality type


Heart Attack Type A Type B

Status

Heart Attack O=25 O=10

No Heart Attack O=5 O=40


Testing for association in
cross tables
Solution
 From a mare observation it seems there
is an association between heart attack and

personality(positive association). But we

test if the association is significant(i.e real


not by chance)
Testing for association in
cross tables
 To test for significant association we use
Pearson Chi-square test

 Now to compute the chi-square, we first


compute expected frequencies(E) i.e

E = (cell’s column total)*(cell’s row total) /


n
Testing for association in
cross tables
Personality Type

A B Row total

Heart Attack O=25 O=10 35

Heart Attack O=5 O=40 45

Column total 30 50 Grand total =80


Testing for association in
cross tables
 E: type A and heart attack:
(30)(35)/80 = 13.125
 E: type A and no heart attack:
(30)(45)/80 =16.875
 E: type B and heart attack:
(50)(35)/80 = 21.875
 E: type B and no heart attack:
(50)(45)/80 = 28.125
Testing for association in
cross tables
 Putting expected information in table we
have
Personality Type
A B Row total
Heart Attack O=25 O=10 35
E=13.12 E=21.875
No Heart Attac O=5 O=40 45
E=16.875 E=28.125
Column total 30 50 grand total=80
Testing for association in
cross tables
 Now to compute the chi-square we use:

 (O  E ) 
2
   
2

 E 
Testing for association in
cross tables
 Thus we have:
( 25  13 .125 ) 2
 (10  21 .875 ) 2

 
2
  
13.125  21.875 
 (5  16.875) 2   (40  28.125) 2 
     
 16.875   28.125 
 13.52
Testing for association in
cross tables
 We reject Ho when
 
2 2
1, 0.05  3.84

 Now our obtained χ2 of 13.52 exceeds


this value.

 We reject H0. i,e there is an association


between heart attack and personality type.
Testing for association in
cross tables
 Assumptions of the Pearson Chi-
square Test
– All expected frequencies ≥ 2. Observed
frequencies can be < 2.

– No 20% of expected frequencies must be < 5


Testing for association in
cross tables
 Special Consideration
– If the expected frequencies in the cells are
“too small,” the Pearson χ2 test may not be
valid, use the Fisher exact test. You can read
about Fisher exact test.
Chi-square goodness of fit
 Used to test whether obsvd freq distr agrees with
expected/theoretical freq distribution

Example Farmers participation in tobacco farming

Response Percent

1.Yes—currently participate 29%

2.Yes—participated in the past 39%

3. No—have never participated 32%


Chi-square goodness of fit
Suppose a current survey of n=200 farmers

indicate the following responses.

Response 1 2 3 Total

Observed 82 64 54 200

Expected 0.29*200 0.39*200 0.32*200

= 58 = 78 =64 200
Chi-square goodness of fit

Solution:

Ho: The observed freqs have same distr as


previous

H1: They have different distribution


Chi-square goodness of fit
 Calculating the Chi-square we have:

 
2 O  E 2

E
(82  58) 2 (64  78) 2 (54  64) 2
  
58 78 64
 9.93  2.51  1.56
 14
Chi-square goodness of fit

 The critical value is   ,df   2


0.05, 2  5.991

 Now since the calculated chi-square is >


5.991, we reject the null hypothesis i.e the
observed frequencies are not the same as
expected frequencies.
Chi-square goodness of fit
Example
Of 64 offspring of a certain cross of guinea

pigs, 34 are red, 10 are black and 20 are

white. According to the genetic model, these

numbers should be in the ratio of 9:3:4. Are

the data consistent with the model?


Chi-square goodness of fit

Solution

Ho: The data agree with genetic model

H1: The data does not agree with the model

We reject Ho when

    0.01, 2  9.210
2 2
Chi-square goodness of fit

 Now the chi-square statistic is


2  
O  E 
2

(34  36) 2 (10  12) 2 (20  16) 2
   1.44
E 36 12 16
 Now since the calculated chi-square is
less than the tabulated chi-square, we fail
to reject Ho i.e the data probably agrees
with the model.
Point and interval estimation of
population parameter
 Point estimate of population parameter is
the single value of the population
parameter.

 Suppose you estimate the statistic class


mean height(parameter) by sample
mean(statistic) as 63 cm, then 63cm is the
point estimate of parameter, mean height.
Interval estimation for
population mean
 Interval estimate of population
parameter(e.g mean) is a number of
values for the parameter using an interval.

 E.g instead of estimate of students mean


height by 63cm we may say that students
height mean/average lies between 62cm
and 65cm or within the interval (62, 65).
Interval estimation for
population mean
 General formula for interval estimation for
population mean  is
 S S  or  S 
 X  Z  , X  Z   S
 X  t  , X  t  
   
 2 n 2 n
 2 n 2 n 

with probability 1


Interval estimation for
population mean
 Note the use of z or t, is due to varying
distribution of sample mean X

as being used to estimate population


mean 
Interval estimation

 The quantity 1 is called confidence


coefficient, i.e it is the measure of a
researcher confidence that the estimate of
population mean lies within the interval

 It is the probability that the estimate of


population mean is within the interval.
Interval estimation of population
mean
Example
Find the 95% confidence interval for the mean

weight of tobacco produced in 2015 using the

sample mean weight of 20 tonnes, and sample

standard deviation of 2 tonnes and sample size

of 40 bells.
Interval estimation for
population mean
Solution

 S S 
 X  Z  , X  Z  
  
 2 n 2 n
 2 2 
  20  1.96  ,20  1.96  
 40 40 
 (19.38,20.6)
Interval estimation of population
mean
 This means estimate of tobacco in 2015
was between 19.38 to 20.6 tonnes with
95% probability/confidence

 The goodness of interval estimation is that


it attaches an error to the estimate.
Interval estimation for
population mean
 That is, e.g if we estimate  by sample X
 S S 
mean by interval  X  Z  , X  Z   ,
 
 2 n 2 n 
S
 The margin of error is  Z 
2 n
Interval estimation of population
mean difference
 Sometimes the interest may be to find
confidence interval for population mean
difference 1   2

 Note that appropriate statistic to use is the


difference between sample means X1  X 2
Interval estimation of population
mean difference
 If we sample from two normal populations
with mean 1 ,  2 and that we know
populations standard deviations 1 ,  2,
then X1  X 2 will be normal with mean
 12  22
1  2 and variance of  and
n1 n2
 2
 2
hence se( X 1  X 2 )  1
 2

n1 n2
Interval estimation of population
mean difference
 If we don’t sample from normal but sample
sizes are large enough i.e n1 , n2  30 and
that 1 ,  2 are unknown and estimated by
sample standard deviations S1 , S 2 then

X1  X 2 would still be normal by CLT.


Interval estimation of population
mean difference
 Under such two scenarios we use
standard normal(Z) distribution to have CI
for mean difference defined as:
 
( X 1  X 2 )  Z   se( X 1  X 2 ), ( X 1  X 2 )  Z   se( X 1  X 2 )
 2 2 
 12  22
where se( X 1  X2) 
n1

n2
case 1

S12 S 22 case 2
se( X 1  X 2 )  
n1 n2
Interval estimation of population
mean difference
 But if sample sizes are small i.e less than
30, and estimate 1 ,  2 by S1 , S 2 then we
use t distribution to have CI i.e CI is
 
( X 1  X 2 )  t ,df  se( X 1  X 2 ), ( X 1  X 2 )  t ,df  se( X 1  X 2 )
 2 2 
1 1 (n1  1)S12  (n2  1)S 22
se( X 1  X 2 )  S  S
n1 n2 n1  n2  2
Interval estimation of population
mean difference
df  n1  n2  2
Note in use of such t distribution we

assume that the two population variances

are equal i.e     


2 2 2
1 2
Interval estimation of population
mean difference
 If we assume that they are not equal then
we would have the same CI for difference
of means using t distribution but with
S12 S 22 and df=smaller of the
se( X 1  X 2 )  
n1 n2
two:

n1  1 or n2  1
Interval estimation of population
mean difference
Example

For a random sample of 190 Agribusiness

firms that revalued their fixed assets, the

mean ratio of debt to tangible assets was

0.517 and the sample standard deviation

was 0.148.
Interval estimation of population
mean difference
 For an independent random sample of 417
firms that did not revalue their fixed
assets, the mean ratio of debt to tangible
assets was 0.489 and the sample
standard deviation was 0.159. Find a 99%
confidence interval for the difference
between the two population means.
Interval estimation of population
mean difference
Solution
 X2  Y2 0.1482 0.159 2
X  Y  z  / 2    0.517  0.489  2.575 
n X nY 190 417

 0.028  2.575 0.0001153 0.00006063

 0.028  0.034

Or (-0.0062, 0.0622)
Interval estimation of population
mean difference
Example

A farmer wants to know estimate

difference in mean maize yield between

UREA and CAN. He/She gets 10 yields in

kgs for each fertilizer and computes the

sample mean yield for each fertilizer.


Interval estimation of population
mean difference
The following is sample data
UREA CAN
X 1  83256 X 1  88354
s1 s2
 3256  2341

n1  10 n2  10

Construct 95% CI for the mean difference


Interval estimation of population
mean difference
Solution
2 2
S S
X 1  X 2  t 1
 2

2
, df n1 n2
2 2
3256 2341
 (83256  88354 )  2.262 
10 10
 5098  2868 .535
 (7967 ,2229 )
Interval estimation for
population proportion
 Just as we had interval estimation for
population mean we can have interval
estimation for population proportion.

 Interval estimation for population


proportion P using the sample proportion P̂
 ˆ (1  Pˆ ) ˆ (1  Pˆ ) 
is defined as  Pˆ  Z  P
, ˆZ 
P
P 
 
n

n 
 2 2 
Interval estimation for
population proportion
 with confidence coefficient of 1
i. e probability that the population
proportion lies within the interval

 Note for confidence interval for proportion


we will only use z distribution for sample
proportion because we will mostly
Interval estimation for
population proportion
Example
FUM conducted a poll between Jan. 14
and Jan 22,2016. They asked 805 as if to
retain FISP or drop it. They found a

sample proportion of 68% who supported


to retain it.
Using Interval Estimation to test
hypothesis
Construct a 95% CI for the true proportion of

FISP supporters.

Solution
Pˆ (1  Pˆ ) 0.68(1  0.68)
Pˆ  Z    0.68  Z 0.05 
2 n 2 805
0.68(1  0.68)
 0.68  1.96 
805
 (0.649,711)
Confidence interval for
proportion difference
 Here we wish to construct interval
estimate for P1  P2

 The appropriate statistic sample to use

sample proportion difference denoted by

Pˆ1  Pˆ2
Confidence interval for
proportion difference
 Note if sample sizes are large i.e n1 , n2  30

then by CLT Pˆ1  Pˆ2 has approximately


normal distribution with mean P1  P2 ,
P(1  P) P(1  P)
variance  and standard error
n1 n2
of P(1  P) P(1  P)

n1 n2
Confidence interval for
proportion difference
 That’s we use standard normal distribution
to have CI for P1  P2 when we have large
samples i.e in this case CI is defined as:
Pˆ1  Pˆ2  Z   se( Pˆ1  Pˆ2 )
2

P1 (1  P1 ) P2 (1  P2 )
 Pˆ1  Pˆ2  Z  
2
n1 n2
 P1 (1  P1 ) P2 (1  P2 ) ˆ ˆ P(1  P) P(1  P) 
  Pˆ1  Pˆ2  Z   , P1  P2  Z   
 2
n1 n2 2
n1 n2 
Confidence interval for
proportion difference
 Where P1 , P2 are approximated by sample
ˆ ,P
ˆ
values P1 2

 In case of assuming equal population


proportions i.e P1  P2  P then CI for P1  P2 is
     
 Pˆ  Pˆ  Z P(1  P)   , Pˆ  Pˆ  Z P(1  P)   
1 1 1 1
 1 2 2 n n  1 2  n n 
  1 2 2  1 2 
Pˆ1 n1  Pˆ2 n2 pooled sample proportion
P
n1  n2
Confidence interval for
proportions difference
Example NFS student believes that a

sweetener called xylitol helps prevent ear

infections. In a randomized experiment 165

children took a placebo and 68 of them got

ear infections.
Confidence interval for
proportions difference
Another sample of children 159 took xylitol

and 46 of them got ear infections.

Construct the 95% CI for difference in

Proportions assuming that the proportions

are not equal.


Confidence interval for
proportions difference
Solution
ˆ 68 ˆ 46
Note that P1   0.412, P2   0.289
165 159
Z   Z 0.05  Z 0.05  1.96
2 2
0.412 (1  0.412 ) 0.289 (1  0.289 )
 (0.412  0.289  1.96  ,
165 159
0.412 (1  0.412 ) 0.289 (1  0.289 )
0.412  0.289  1.96  )
165 159
 ( 0.020, 0.226)
Testing two tailed hypothesis
formulation using CI
 An interval estimate of a parameter may
be used to test the two tailed hypothesis
formulation about the parameter.

 In this case you fail to reject Ho if CI


contains the parameter value under Ho
and reject Ho if CI excludes this value.
Testing hypothesis using CI
Example
Find the 95% confidence interval for the mean

weight of tobacco produced in 2015 using the

sample mean weight of 20 tones, and sample

standard deviation of 2 tones and sample size

of 40 bells and use your CI to test claim that

mean weight was 17 tones.


Two tailed hypothesis test using
Confidence Interval
Solution

Hypothesis formulation:
Ho:  = 17 versus

H1:  ≠ 17
Two tailed hypothesis test using
CI
95% Confidence Interval for mean weight is
 S S 
 X  Z  , X  Z  
  
 2 n 2 n
 2 2 
  20  1.96  ,20  1.96  
 40 40 
 (19.38,20.6)
Two tailed hypothesis test using
CI
Now since the interval does not contain the

value under Ho, we reject Ho and adopt the

alternative hypothesis and say, the mean

weight was different from 17 tones.


Two tailed hypothesis test using
CI
 Note that we use CI to test two tailed
hypothesis formulation only and not one
tailed formulation since the confidence
interval is two sided.

The End

You might also like