Professional Documents
Culture Documents
PGD Sta - Sta 703
PGD Sta - Sta 703
or σ x=
√
σ N −n
√ n N−1
when the population size N is finite and n ≥ 0.05 N
σ x=
√
σ N −n 12 900−81
=
√
√ n N−1 √ 81 900−1
=
√
12 819
9 899
=1.33 √ 0.9110 =1.2695
Example 3: Find the probability that the mean X of a random sample of size 36 taken
from population in example 1 lies between 18 and 24.
18−20 18−20
Z1 = = = -1
σ /√n 2
24−20
Z2 = = 2
2
P(18<x<20) = P(-1<Z<2) = P(x<2) – (Px< - 1) =0.8185
Estimation Using Normal Distribution
Point Estimate is a single value
The point estimate is unbiased if the mean of repeated samples from a population is the
mean of the population E ( x )=μ
Interval Estimate is a range of values with the probability (or confidence level) that the
interval includes the unknown population parameter. If we are given a confidence level of
1 - ∝ we can than obtain confidence interval as:-
P ( X −Z 1−α∗σ x < μ< X + Z 1−α∗σ x )=1−α
X−μ
Since Z = +¿ Z= ¿
σx
the μ= X + ¿ Z 1−σ∗σ x ¿
p− p
For proportion: Z =
√ p(1− p) assuming n< 0.05N
n
This is a Binomial distribution, however when n is large where both np˃5 and n(1–p) ˃ 5
then the distribution tends to normal.
The confidence interval for proportion is give by:
P = p + Z1-α P= p+ Z 1−α
√ P (1− p)
n
Confidence intervals for the mean using t- Distribution
When a population is normally distributed but its standard deviation σ is unknown and
n˂30, then Confidence interval cannot be determined using the normal distribution, the t-
distribution is then used.
s
Confidence interval of mean at α is X ± t 1−α
√n
s
σ x is estimated using where s = sample standard deviation.
√n
Chebyshev’s Theorem
When the sample size n˂30 and the population is not normally distributed, we cannot use
the Normal distribution or t distribution to determine the confidence interval. The
Chebyshev’s theorem is then applied.
Given that the level of significant = α, then confidence limit = 1-α .
1 1
According to the Chebyshev’s theorem 1-α = 1 - 2 . This implies that 2 = α
k k
∴ k2 =
1
α
and k =
1
α √
σ
Therefore μ= x +¿ k . ¿
√n
Note Chebyshev’s theorem is not often used because the Confidence Interval is usually
large .
Example I: A random sample of size n ¿25 has a mean X = 80. If the sample was taken
from a population of size 1000 and the population standard deviation is known to be σ =
30. Given that the population is not normally distributed, find the 95% confidence
interval for the unknown population mean.
1 – α = 0.95
1 1 1
1- 2 = 0.95 - 2 = 0.05 k2 = = 20
k k 0.05
K = + √ 20
σ
The 95% Confidence interval for μ is X + k.
√n
4.47∗30
µ ¿ X +¿ ¿
√25
= 80 + 26.82
Example II: A pharmacy wanted to estimate with 90% level within +¿ ¿0.06 of the actual
proportion of buyers who prefer a particular brand of pain relief drug to some other
brand. Past records show that 30% of buyers choose the particular brand. What is the
minimum sample size to be taken for the estimate?
p− p 0.06
Z = =
σp σp
0.06
Z.σ p = Z . σ p =0.06 σ p=
Z
Z 0.90=1.64
0.06
σ p= =0.0366
1.64
But σ p=
√ p(1− p)
n
p (1− p) p(1− p)
σ 2p= ∴ n= 2
n σp
0.3 x 0.7
n= ¿ 157
(0.0366)2
Example lll: Obtain the 95% confidence interval of a population of size 800 if a random
sample of size¿64, taken from the population, has a mean X =50 and standard deviation S
=20.
Solution
Since n > 30, we can assume normal distribution .
n = 64 N = 800
0.05N = 40 n = 64 > 0.05N (=40)
σx =
S
√n √ N −n
N −1
σ x=
√
20 800−64
√ 64 800−1
=2.4 Type equation here .
HYPOTHESES TESTING
Hypothesis testing is one of the fundamental aspects of statistical inference and analysis.
Hypotheses testing is concerned with testing an assumption about a population parameter
based on the result obtained from a sample drawn from the population .
Definitions:
i) Type I error Rejecting a hypothesis that is true
ii) Type II error Accepting a hypothesis that is false.
iii) One tail test only one extreme end of the distribution is considered
iv) Two tail test the two extreme ends of the distribution are considered
v) Null hypothesis : The null hypothesis is based on the premise of no
difference, no association or no effect i.e. can be likened to an accused is not
guilty until proven otherwise.
vi) Alternative hypothesis : usually reflects the Investigator’s belief or what is
expected from the research study.
TEST OF HYPOTHESES
1. State the null hypothesis Ho and the alternative hypothesis H1
2. Determine the level of significant for the test.
3. Determine the test statistic.
4. State the decision rule.
5. Calculate the test statistic based on values got from the sample.
6. Make a decision.
7. Conclude.
Test of Hypotheses About Population Means And Proportion
Example l: A medical researcher claims that the average lifetime expectancy of women
in Nigeria is 72 years. Test this claim if a sample of 100 women taken from obituaries of
women who had died show that the average age of the dead women was 69 years with a
standard deviation of 15 years. Use a level of significance α = 0.05.
Hypotheses: Ho: µ = 72 H1: µ ≠ 72
Level of significance: α =0.05
X−µ X−μ
Test Statistics: Z = or Z = if n>30
σ / √n s/ √ n
Decision Rule: Accept Ho if –Z1-α < Z < Z1-α
Calculation: Since n>30, we shall use s as estimate of σ
µ=72 X =69 n = 100
X−μ X−μ 69−72
Z = = =
σx s/ √ n 15/ √ 100
3 3
= - = - = -2.0
15/10 1.5
Accept
From the Table: Z1 – α = -1.96 Reject Reject
t
Since the calculated Z-value falls in the rejection region we shall not accept H0.
Conclusion Based on the available data, the conclusion will be that life expectancy of
women in Nigeria is not 72 years.
Example ll A poultry farmer claims that over the years 80% of his broiler chicks survive
up to 10 weeks. In a batch of 36 day old broilers that he reared, only 25 of them survived
up to 10 weeks. Based on this observation, test whether that batch of broilers performed
worse than the previous batches. Use a level of significance α = 0.05
Solution
Hypotheses: Ho: p≥0.80 H1: p<0.80
Level of significance : α = 0.05
p− p
p− p
Test Statistic: Z =
σP
=
√ p(1− p)
n
Decision Rule: Reject Ho if calculated Z < Z1-α (= -1.64)
25
Calculation: p= = 0.69
36
p− p 0.69−0.80
0.11
Z =
√ p(1− p)
n
=
√ 0.80 X 0.20 = -
36
√ 0.16/36
0.11
=- = -1.65
0.0667
One Tail Test: Z1-α to the left ∴ Z = -1.64
Decision Since calculated Z < Zα then we shall not accept Ho
Conclusion: Based on the analysis of available data, we shall conclude that that particular
batch of broilers performed worse than the previous batches.
Example lll: (small sample)
A hospital management wants to know with a 90%level of confidence if the 50cl bottles
of syrup actually contains 50cl of syrup. From past records it is known that the volume
of the syrup in the bottles is normally distributed. The hospital store officer took a
random sample of 16 bottles and found that X ¿ 52cl with standard deviation s = 7.5cl.
Solution
Hypotheses: Ho: µ = 50 H1:µ 50. This is a one tail test
X−μ
Test Statistic: t=
s/ √ n
Decision Rule: Accept Ho if the calculated value of t falls within the acceptance region
i.e. if t value > tα,15
X−μ 52−50 2 2
Calculation: t = = = =
s/ √ n 0.75/ √ 25 7.5/5 1.5
= 1.3333
From the t-table, t15,0.05 =1.7111
DECISION: Since the calculated t –value (=1.3333) ˂ 1.7111 (i.e. it falls within the
acceptance region),we shall not reject the null hypothesis Ho.
CONCLUSION: The Hospital management can accept with 90% confidence that the
volume of syrup in the bottles is 50cl.
Example IV (proportion and small sample ). In Example II, if the number of broiler
chicks in the batch was not 36 but 16 out of which only 12 survived to 10 weeks . The
solution will be as given below:
Solution:
-1.753
12 3
p= = =0.75
16 4
Hypothesis: H0: p = 0.80 H1: P< 0.80
Level of significance =
Decision Rule: Accept Ho if calculated t< Table t
Calculation
p=
12
16
= 0.75 σp =
√ 0.8 x 0.2
16
=
√ 0.16
16
= √ 0.01 = 0.1
p− p 0.75−0.8 0.05
t= = = - = -0.5
σp 0.1 0.1
t 15,0.05 = -1.753
Decision:
Since the t value > t0.05,15 we shall not reject Ho because the t – value falls within the
acceptance region.
Conclusion: Based on the analysis of the available data, it can be concluded that the
poultry farmer’s claim is true.
Test of Hypotheses for differences between two means or proportions
σx1 – x2 =
√ σ 21 σ 22
+
n1 n2
or
√ s21 s22
+
n1 n2
σ
p1 – p2 =
√ p(1− p) p (1− p)
n1
+
n2
n1 p1 +n 2 p2
p= i.e. p = weighted average of p1 and p2
n1 +n 2
Test Statistic
x1− x x 1−x 2
√
2
√
2 2
For large n, Z = σ 1 +σ 2 or Z= s1 +s 2
( ¿ )¿ ( ¿ )¿
n1 n2 n1 n 2
p 1− p2
or
√
Z = p(1− p) p(1−p)
n1
+
n2
Example 1
Researchers conducted a study to determine whether magnets are effective in treating
back pain. Pain was measured using the visual analog scale and the coded results, given
below, were obtained from a pilot study.
Reduction in pain level after magnet treatment: n =54 X =54 S=
18
Reduction in pain after placebo treatment: n = 40 X =60 S = 20
Based on the result, test at 5% level of significance if the magnet treatment is effective.
Solution
Hypotheses – H0: µ1 = µ2 H1: µ1 < µ2
Level of significance α = 0.05
x 1−x 2
Z=
√
2 2
Test statistics: s1 + s2
( ¿ )¿
n 1 n2
Decision Rule: Reject H0 if the Z value < - Z1-α
Calculation
n1 = 54 x 1=0.54 s1 = 0.18
n2 = 40 x 2=0.60 s2 = 0.20
x 1−x 2
Z≅
√
2 2
(0.18) (0.20)
+
54 40
−0.06
¿
√ 0.0006+ 0.001
−0.06
¿
√0.0016
0.06
¿− =−1.5
0.04
Z1-α = Z0.05 = -1.645
Decision: Since Z value (= - 1.5) > Zα (= -1.645), the null hypothesis shall be accepted
Conclusion: Based on the results from the pilot study, it can be concluded that the
magnet treatment than the placebo treatment.
Example 2:
A hospital wanted to know if two brands of surgical gloves tested the same to stress.
Among 100 gloves taken from Brand 1, 9% of the leaked of the 80 gloves taken from
Brand 2, 7% of them leaked under stress.
Test the claim that there is no difference in the leakage rate for the two brands. Use a
level of significance α = 0.10
Hypotheses – H0: p1 = p2 H1: p1 p2
Level of significance α = 0.10
P 1−P2
Z= n1 p1+ n2 p 2
Test statistic
√ p(1− p) p(1−p) where p=
n1
+
n2
n1+n 2
Decision Rule: Reject H0 if −Z α < Z−value< +Z α i.e. – Z0.05 < Z < Z0.05
2 2
√ p(1− p) p(1−p)
n1
+
n2 √ ( 0.08 ) (0.92) ( 0.08 ) ( 0.92)
100
+
80
0.02 0.02 0.02
¿ = =
√0.000736+ 0.00092 √ 0.001656 0.041
= 0.4878
Z α =1.664
2
Decision: Since Z value < Z α ,the null hypothesis H0 will be not be rejected
2
Conclusion: Based on the analysis of the available data, it will be concluded that the two
brands of gloves have the same strength.
ANALYSIS OF CATEGORICAL DATA
Categorical (or qualitative or attribute) data are data that can be separated into different
categories (called cells) that are distinguished by non-numerical based characteristics
experiments that result in categorical data re multinomial. A multinomial experiment can
be likened to a binomial experiment except that unlike a binomial experiment that has
two categories, in a multinomial there are more than two categories.
Conditions for multinomial experiments
i. The number of trials is fixed
ii. The trials are independent
iii. All outcomes of each trial must be classified into exactly one of several
different categories
iv. The probabilities for the different categories remain constant for each trial
0 χ2 distribution
If independent samples of size n are selected at random from a normally distributed
population with variance σ2, the sample variance s2 obtained from the samples has a
( n−1 ) s 2
statistic χ2 = 2 This statistic is said to be Chi-square distributed.
σ
The Chi-square statistic is used in the analysis of categorical data
Properties of Chi-square Distribution
- Unlike the normal and t distributions, the Chi-square distribution is not symmetric
- Values of Chi-square statistic can be zero or positive but not negative
- The Chi-square distribution is dependent on number of degrees of freedom
- As number of degrees of freedom increases, the chi-square distribution
approaches the normal distribution
Uses of the Chi-square distribution
(i) To test goodness of fit – to test the hypothesis that an observed frequency
distribution fits some claimed distribution.
(ii) To test for independency in a contingency table –to test that there is no
association between row variables and column variables in a contingency
table.
(iii) To test for homogeneity between samples –to test the claim that different
populations have the same proportions of some characteristics.
The probability distribution function of a random variable Y which is χ 2 distributed is
1
given as f(y) = n
2 ( )
−1 !
2-n/2 y(n/2)-1 e-y/2
Calculation
n 175
E= = =25
k 7
Expected for each day = 25
Day No of fatalities O Expected no O – E (O - E)2 (O−E)2
of fatalities E E
Sunday 30 25 5 25 1
Mon 20 25 -5 25 1
Tue 18 25 -7 49 1.96
Wed 21 25 -4 16 0.64
Thu 22 25 -3 9 0.36
Fri 29 25 -4 16 0.64
Sat 35 25 10 100 4.00
Total 175 175 9.60
2
χ 6,005=12.592 χ2 value = 9.60
Decision Since the χ 2−value (=9.60) < χ 26,005 (=12.592)
The null hypotheses will not be rejected
Conclusion: Based on the analysis of the sample data, there is no sufficient basis to
conclude that the number of fatalities on the days of the week differ significantly.
Example II
In Genetics, it is claimed that if a couple are both AS genotype, then their offspring could
be AA, AS or SS with probabilities 0.25, 0.50 and 0.25 respectively. The data below lists
the genotype frequencies of 180 randomly selected offspring from couples that are both
AS genotype.
Genotype AA AS SS
Frequencies 44 110 26
E
AA 44 0.25 x 180 = 45 0.0222
AS 100 0.50 x 180 = 90 1.1111
SS 36 0.25 x 180 = 45 1.8000
Total 180 180 2.9333
2
χ2 value = 2.9333 χ 2,0.01 (= 9.21)
Decision: Since χ2 value (= 2.9333) < χ 22,0.01 (= 9.21), the null hypothesis Ho will not be
rejected
Conclusion: Based on the analysis of the data, it will be concluded that the genetics claim
is true.
2) Test of independence: Use of the Contingency Table
In the test of independence the concern is to find the dependency relationship between
two or more attributes. The population and sample are classified into several attributes.
The test of interest is to know whether there is any dependency relationship between the
attributes. The test does not indicate the level of discrepancy for the contingency table.
Total∈Column x Total ∈raw
Expected value in each cell =
Overall Total
Example 1 A study was conducted to test whether smoking has any relationship with
gender.
A random sample of 100 persons were selected and the distribution is as shown:-
Male Female Total
Smoker 30 10 n1. = 40
Non Smoker 20 40 n2. = 60
Total 50n.1 50n.2 100
n1. x n .i 40 x 50
n11 = = = 20
n .. 100
n xn 40 x 50
n12 = 1. .2 = = 20
n 100
Similarly n21 = 30 n22 = 30
2 2 2
(o −e ) (30−20) (10−20)2 (20−30)2 ( 40−30)2
2
χ2 = ∑ ∑ ij ij = + + + = 16.66
i=i j=i n 20 20 30 30
PET/CT
Correct Incorrect
Correct 36 1
MRI
Incorrect 11 2
Test the claim that there is no difference in the accuracy of the tests. Use a level of
significance α = 0.05
Solution
Calculation: This example is a test of homogeneity.
PET/CT
Correct Incorrect Total
Correct 36 (34.78) 1(2.22) 37
MRI
Incorrect 11(12.22) 2(0.78) 13
Total 47 3 50
n i . n. j n1. x n.1 37 x 47
E= e.g. n11= = =34.78
nij total 50
n2. x n.1 13 x 47
n21= = =12.22
total 50
37 x 3 13 x 3
Similarly n12= =2.22 and n22= =0.78
50 50
2 2 2 2
2 ( 36−34.78) ( 11−12.22) (1−2.22) (2−0.78)
χ= + + +
34.78 12.22 2.22 0.78
= 0.0428 + 0.1218 + 0.6705 + 1.9082 = 2.7433
2
Degree of freedom = (2-1)(2-1) = 1 x 1 = 1 χ 1,0.05 =3.841
Decision
2
Since χ2 value (=2.7433) < χ 1,0.05 ( ¿ 3.841 ), the null hypothesis will not be rejected
Conclusion: Based on the analysis of the available data, it can be concluded that there is
no significant difference between diagnosis made using the two tests.
Note: Before computing χ2, the Yate’s correction can be made for cells with expected
count
(E) < 2
Example II
Two different schools were to be considered in terms of their performances. Sample of
100 students each were taken from each school and a test was administered on them.
The grades obtained are as given below:
Grade School
1 2
A 10 10
B 20 10
C 30 40
D 20 30
F 20 10
100 100
Is there any difference between the grades of the two schools.
School I II
Grade I II ( oi−ei ) 2
( oi−ei )2
ei ei
A 10(10) 10(10) 0 0
B 20(15) 10(15) 1.7 1.7
C 30(35) 40(35) 0.7 0.7
D 20(25) 10(25) 1.0 1.0
F 20(15) 10(15 1.7 1.7
5.1 5.1
Example
Suppose two different processes are used to manufacture bulbs. If the life of the light
bulb for the process A is normally distributed with mean µ o and standard deviation σo and
that for B is also normally distributed with mean µ1, and standard deviation σ1. The two
means can be compared by taking a sample from each process.
Let the observation be:-
Sample A Sample B
n = 17 n = 21
X a= 120hr X b = 1300hr
Sa = 60hr Sb = 50hr
2
2 n a Sa 2
σa / nb S b
F = = n a−1
σ 2b
nb−1
17(60¿¿ 2) 21(50¿¿ 2)
= / ¿¿
16 20
51
=
55
= 1.46
F16, 20, 0.05 = 2.18
F < F16, 20, 0.05 ∴ Ho is not rejected σ a = σb
2 2 2
ii) On the other hand if H0: σ a = σ b vs H, σ a ≠ σ bthen we shall compare the
2
calculated F with F16,20,0.025 (= 2.46) and since the F value (=1.46) < 2.46, ∴ the
none hypothesis is not rejected.
where a =Y - b X and b =
∑ ( X− X ) (Y −Y ) or b = ∑ ( X− X ) (Y −Y )
∑ ( X −X )2 ∑ ( X −X )2
The Derivation
From y = a + bx
Sum b.s of (1) y = a + b x
y = na + bx ….. (2)
Also m.b.s of (1) by x and sum: xy = ax + bx2 …. (3)
∑ x ∑ y −b (∑ x ) +b ∑ x 2
2
¿
n n
2
∑ xy − ∑ ∑ =b ∑ x 2− ∑
x y b( x )
n n
Note that a linear regression can be determined by plotting a scatter diagram of X against
Y and using the free hand method to draw the line of best fit. However, using the free
hand method does not usually give accurate results.
Worked Examples
1. If it is believed that the weight of a person is related to his height, determine the
regression equation for a sample of 9 students whose heights and weights are as given
below:-
Height 1.3 1.4 1.4 1.5 1.5 1.5 1.6 1.6 1.7
Weight 130 135 145 132 150 168 165 180 170
Computation:-
Student X Y X
2
XY
1 1.3 130 1.69 169
2 1.4 135 1.96 189
3 1.4 145 1.96 203
4 1.5 132 2.25 198
5 1.5 150 2.25 225
6 1.5 168 2.25 252
7 1.6 165 2.56 264
8 1.6 180 2.56 288
9 1.7 170 2.89 289
Total 13.5 1375 20.37 2077
Computation:-
S/ Gestational Birth Weight x2 y2 xy
No Age (weeks) x (grams) y
1 34.7 1895 1204.09 3591025 65756.5
2 36.0 2030 1296 4120900 73080
3 29.3 1440 858.49 2073600 42192
4 40.1 2835 1608.01 8037225 113683.5
5 35.7 3090 1274.49 9548100 110313
6 42.4 3827 1797.76 14645929 162264.8
7 40.3 3260 1624.09 10627600 131378.0
8 37.3 2690 1391.29 7236100 100337
9 40.9 3285 1672.81 10791225 134356.5
10 38.3 2920 1466.89 8526400 111836
11 38.5 3430 1482.25 11764900 132005
12 41.4 3657 1713.96 13373649 151399.8
13 39.7 3685 1576.09 13579225 146294.5
14 39.7 3345 1576.09 11189025 132796.5
15 41.1 3260 1689.21 10627600 133986
16 38.0 2680 1444 7182400 101840
17 38.7 2005 1497.69 4020025 77593.5
Total 652.1 49334 25173.21 150934928 1921112.6
X =38.36=38.36 ❑
❑
√
1
n−1
∑ ( x−x)2 .
√
n−1
1
∑ ( y − y)2
∑ ( x −x ) ( y− y ) = n ∑ xy−∑ x ∑ y
¿
√¿¿¿
√ ∑ (x−x )2 ∑ ( y− y )2
Worked Example:
In the example 2 on gestation period and birth weight given above, calculate the sample
correlation coefficient and draw appropriate conclusion on the association between gestation
period and infant birth weight.
Computation:
17 (1921112.6 )−( 652.1 ) ( 49334)
r= = 0.82
√ {17 ( 25173.21 )−¿ ¿ ¿
Conclusion:- The value of r is positive and close to 1 which implies that there is a strong
relationship between gestation period and birth weight. The longer the gestation period the
more the baby’s birth weight.
Rank Correlation
There are cases where a dependency between 2 variables x and y can be observed but the
distribution is not known. If this is so, then methods of calculating r as before cannot be used.
6∑d
2
b^ 1=
(∑ x 21 )(∑ x 22 )−(∑ x 1 x 2 )
2
(∑ x 2 y ) ( ∑ x 21 ) −∑ ( x 1 y ) (∑ x1 x 2)
b^ 2=
(∑ x 21 )(∑ x 22 )−(∑ x 1 x2 )
2
b^ 0= y− b^ 1 x 1−b^ 2 x 2
where x 1=x 1−x 1
and x 2=x 2−x 2