PGD Sta - Sta 703

STATISTICAL INFERENCE
Sampling Distribution of the Mean

When samples are drawn from a population, the mean of these samples differ from
sample to sample. The probability distribution of these sample means is called the
sampling distribution of the mean.
The sampling distribution has a mean µx and it also has a standard deviation which is
called the standard error σ x
The E ( µx ) = μ i.e. if we take repeated samples, the mean of the means of these samples
will be equal to the population mean
σ
Also σ x = where σ = population standard deviation
n
and n = sample size
or σ x=
√
σ N −n
√ n N−1
when the population size N is finite and n ≥ 0.05 N
As n increases i.e. as n  ∞, the sampling distribution of the mean approaches normal

irrespective of the population distribution. The Central Limit Theorem states that when n
is large (n ≥ 30), the sampling distribute of mean is normal
Example 1: Find the sampling distribution of the mean of a sample of size 36 if the
sample was drawn from a population size of 900 whose mean is 20 and standard
deviation is 12.
Solution
Check if n ˃ 0.05 N
n = 36 N = 900
0.05 x 900 = 45 n = 35 < 45
:- Sampling distribution of this mean is μ x =μ=20 and the
σ 12
standard deviation σ x = = =2
√ n √36
σ 12
σx = = =6
√ n √36
Example 2 Find sample distribution if the sample size in Example I is 81.
n = 81 0.05 x 900 = 45 n > 45
Solution
∴ sampling distribution is μ x =μ=20
σ x=
√
σ N −n 12 900−81
=
√
√ n N−1 √ 81 900−1
=
√
12 819
9 899
=1.33 √ 0.9110 =1.2695
Example 3: Find the probability that the mean X of a random sample of size 36 taken
from population in example 1 lies between 18 and 24.
18−20 18−20
Z1 = = = -1
σ /√n 2
24−20
Z2 = = 2
2
P(18<x<20) = P(-1<Z<2) = P(x<2) – (Px< - 1) =0.8185
Estimation Using Normal Distribution
Point Estimate is a single value
The point estimate is unbiased if the mean of repeated samples from a population is the
mean of the population E ( x )=μ
Interval Estimate is a range of values with the probability (or confidence level) that the
interval includes the unknown population parameter. If we are given a confidence level of
1 - ∝ we can than obtain confidence interval as:-
P ( X −Z 1−α∗σ x < μ< X + Z 1−α∗σ x )=1−α
X−μ
Since Z = +¿ Z= ¿
σx
the μ= X + ¿ Z 1−σ∗σ x ¿
p− p
For proportion: Z =
√ p(1− p) assuming n< 0.05N
n
This is a Binomial distribution, however when n is large where both np˃5 and n(1–p) ˃ 5
then the distribution tends to normal.
The confidence interval for proportion is give by:
P = p + Z1-α P= p+ Z 1−α
√ P (1− p)
n
Confidence intervals for the mean using t- Distribution
When a population is normally distributed but its standard deviation σ is unknown and
n˂30, then Confidence interval cannot be determined using the normal distribution, the t-
distribution is then used.
s
Confidence interval of mean at α is X ± t 1−α
√n
s
σ x is estimated using where s = sample standard deviation.
√n
Chebyshev’s Theorem
When the sample size n˂30 and the population is not normally distributed, we cannot use
the Normal distribution or t distribution to determine the confidence interval. The
Chebyshev’s theorem is then applied.
Given that the level of significant = α, then confidence limit = 1-α .
1 1
According to the Chebyshev’s theorem 1-α = 1 - 2 . This implies that 2 = α
k k
∴ k2 =
1
α
and k =
1
α √
σ
Therefore μ= x +¿ k . ¿
√n
Note Chebyshev’s theorem is not often used because the Confidence Interval is usually
large .
Example I: A random sample of size n ¿25 has a mean X = 80. If the sample was taken
from a population of size 1000 and the population standard deviation is known to be σ =
30. Given that the population is not normally distributed, find the 95% confidence
interval for the unknown population mean.
1 – α = 0.95
1 1 1
1- 2 = 0.95 - 2 = 0.05 k2 = = 20
k k 0.05
K = + √ 20
σ
The 95% Confidence interval for μ is X + k.
√n
4.47∗30
µ ¿ X +¿ ¿
√25
= 80 + 26.82
Example II: A pharmacy wanted to estimate with 90% level within +¿ ¿0.06 of the actual
proportion of buyers who prefer a particular brand of pain relief drug to some other
brand. Past records show that 30% of buyers choose the particular brand. What is the
minimum sample size to be taken for the estimate?
p− p 0.06
Z = =
σp σp
0.06
Z.σ p = Z . σ p =0.06 σ p=
Z
Z 0.90=1.64
0.06
σ p= =0.0366
1.64
But σ p=
√ p(1− p)
n
p (1− p) p(1− p)
σ 2p= ∴ n= 2
n σp
0.3 x 0.7
n= ¿ 157
(0.0366)2
Example lll: Obtain the 95% confidence interval of a population of size 800 if a random
sample of size¿64, taken from the population, has a mean X =50 and standard deviation S
=20.
Solution
Since n > 30, we can assume normal distribution .
n = 64 N = 800
0.05N = 40 n = 64 > 0.05N (=40)
σx =
S
√n √ N −n
N −1
σ x=
√
20 800−64
√ 64 800−1
=2.4 Type equation here .
µ = X ± 2.4 Z1-α = 50 + 2.4 x 1.96 = 50 + 4.7

Example lV: A random sample of size of n ¿ 25 has a mean 80. The sample was taken
from a normally distributed population of size 1000 whose standard deviation is 30.
Calculate the 90% confidence interval.
Solution
N = 100 n = 25 0.05N = 0.05 x 1000 = 50
n< 0.05N n = 25 < 0.05N
σ 30
∴ µ = X + Z1-α = 80 + 1.64 x
√n √25
= 80 + 9.84
Example V: A random sample of 36 students was taken out of 500 students who took an
I Q test. The mean score for the 36 students is 380. If the population standard deviation
is 40, find the 95% Confidence Interval of the population mean.
n = 36 0.05N = 0.05x500 = 25
Since n> 0.05N, σ x =

σ
√n √ N −n
N −1
= 6.4 ∴ µ = 380 + 1.96 x 6.4
HYPOTHESES TESTING
Hypothesis testing is one of the fundamental aspects of statistical inference and analysis.
Hypotheses testing is concerned with testing an assumption about a population parameter
based on the result obtained from a sample drawn from the population .
Definitions:
i) Type I error Rejecting a hypothesis that is true
ii) Type II error Accepting a hypothesis that is false.
iii) One tail test only one extreme end of the distribution is considered
iv) Two tail test the two extreme ends of the distribution are considered
v) Null hypothesis : The null hypothesis is based on the premise of no
difference, no association or no effect i.e. can be likened to an accused is not
guilty until proven otherwise.
vi) Alternative hypothesis : usually reflects the Investigator’s belief or what is
expected from the research study.
TEST OF HYPOTHESES
1. State the null hypothesis Ho and the alternative hypothesis H1
2. Determine the level of significant for the test.
3. Determine the test statistic.
4. State the decision rule.
5. Calculate the test statistic based on values got from the sample.
6. Make a decision.
7. Conclude.
Test of Hypotheses About Population Means And Proportion
Example l: A medical researcher claims that the average lifetime expectancy of women
in Nigeria is 72 years. Test this claim if a sample of 100 women taken from obituaries of
women who had died show that the average age of the dead women was 69 years with a
standard deviation of 15 years. Use a level of significance α = 0.05.
Hypotheses: Ho: µ = 72 H1: µ ≠ 72
Level of significance: α =0.05
X−µ X−μ
Test Statistics: Z = or Z = if n>30
σ / √n s/ √ n
Decision Rule: Accept Ho if –Z1-α < Z < Z1-α
Calculation: Since n>30, we shall use s as estimate of σ
µ=72 X =69 n = 100
X−μ X−μ 69−72
Z = = =
σx s/ √ n 15/ √ 100
3 3
= - = - = -2.0
15/10 1.5
Accept
From the Table: Z1 – α = -1.96 Reject Reject
t
Since the calculated Z-value falls in the rejection region we shall not accept H0.
Conclusion Based on the available data, the conclusion will be that life expectancy of
women in Nigeria is not 72 years.
Example ll A poultry farmer claims that over the years 80% of his broiler chicks survive
up to 10 weeks. In a batch of 36 day old broilers that he reared, only 25 of them survived
up to 10 weeks. Based on this observation, test whether that batch of broilers performed
worse than the previous batches. Use a level of significance α = 0.05
Solution
Hypotheses: Ho: p≥0.80 H1: p<0.80
Level of significance : α = 0.05
p− p
p− p
Test Statistic: Z =
σP
=
√ p(1− p)
n
Decision Rule: Reject Ho if calculated Z < Z1-α (= -1.64)
25
Calculation: p= = 0.69
36
p− p 0.69−0.80
0.11
Z =
√ p(1− p)
n
=
√ 0.80 X 0.20 = -
36
√ 0.16/36
0.11
=- = -1.65
0.0667
One Tail Test: Z1-α to the left ∴ Z = -1.64
Decision Since calculated Z < Zα then we shall not accept Ho
Conclusion: Based on the analysis of available data, we shall conclude that that particular
batch of broilers performed worse than the previous batches.
Example lll: (small sample)
A hospital management wants to know with a 90%level of confidence if the 50cl bottles
of syrup actually contains 50cl of syrup. From past records it is known that the volume
of the syrup in the bottles is normally distributed. The hospital store officer took a
random sample of 16 bottles and found that X ¿ 52cl with standard deviation s = 7.5cl.
Solution
Hypotheses: Ho: µ = 50 H1:µ  50. This is a one tail test
X−μ
Test Statistic: t=
s/ √ n
Decision Rule: Accept Ho if the calculated value of t falls within the acceptance region
i.e. if t value > tα,15
X−μ 52−50 2 2
Calculation: t = = = =
s/ √ n 0.75/ √ 25 7.5/5 1.5
= 1.3333
From the t-table, t15,0.05 =1.7111
DECISION: Since the calculated t –value (=1.3333) ˂ 1.7111 (i.e. it falls within the
acceptance region),we shall not reject the null hypothesis Ho.
CONCLUSION: The Hospital management can accept with 90% confidence that the
volume of syrup in the bottles is 50cl.
Example IV (proportion and small sample ). In Example II, if the number of broiler
chicks in the batch was not 36 but 16 out of which only 12 survived to 10 weeks . The
solution will be as given below:
Solution:
-1.753
12 3
p= = =0.75
16 4
Hypothesis: H0: p = 0.80 H1: P< 0.80
Level of significance =
Decision Rule: Accept Ho if calculated t< Table t
Calculation
p=
12
16
= 0.75 σp =
√ 0.8 x 0.2
16
=
√ 0.16
16
= √ 0.01 = 0.1
p− p 0.75−0.8 0.05
t= = = - = -0.5
σp 0.1 0.1
t 15,0.05 = -1.753
Decision:
Since the t value > t0.05,15 we shall not reject Ho because the t – value falls within the
acceptance region.
Conclusion: Based on the analysis of the available data, it can be concluded that the
poultry farmer’s claim is true.
Test of Hypotheses for differences between two means or proportions
σx1 – x2 =
√ σ 21 σ 22
+
n1 n2
or
√ s21 s22
+
n1 n2
σ
p1 – p2 =
√ p(1− p) p (1− p)
n1
+
n2
n1 p1 +n 2 p2
p= i.e. p = weighted average of p1 and p2
n1 +n 2
Test Statistic
x1− x x 1−x 2
√
2
√
2 2
For large n, Z = σ 1 +σ 2 or Z= s1 +s 2
( ¿ )¿ ( ¿ )¿
n1 n2 n1 n 2
p 1− p2
or
√
Z = p(1− p) p(1−p)
n1
+
n2
Example 1
Researchers conducted a study to determine whether magnets are effective in treating
back pain. Pain was measured using the visual analog scale and the coded results, given
below, were obtained from a pilot study.
Reduction in pain level after magnet treatment: n =54 X =54 S=
18
Reduction in pain after placebo treatment: n = 40 X =60 S = 20
Based on the result, test at 5% level of significance if the magnet treatment is effective.
Solution
Hypotheses – H0: µ1 = µ2 H1: µ1 < µ2
Level of significance α = 0.05
x 1−x 2
Z=
√
2 2
Test statistics: s1 + s2
( ¿ )¿
n 1 n2
Decision Rule: Reject H0 if the Z value < - Z1-α
Calculation
n1 = 54 x 1=0.54 s1 = 0.18
n2 = 40 x 2=0.60 s2 = 0.20
x 1−x 2
Z≅
√ since sample sizes are large

2 2
s 1 s2
+
n 1 n2
0.54−0.60
¿
√
2 2
(0.18) (0.20)
+
54 40
−0.06
¿
√ 0.0006+ 0.001
−0.06
¿
√0.0016
0.06
¿− =−1.5
0.04
Z1-α = Z0.05 = -1.645
Decision: Since Z value (= - 1.5) > Zα (= -1.645), the null hypothesis shall be accepted
Conclusion: Based on the results from the pilot study, it can be concluded that the
magnet treatment than the placebo treatment.
Example 2:
A hospital wanted to know if two brands of surgical gloves tested the same to stress.
Among 100 gloves taken from Brand 1, 9% of the leaked of the 80 gloves taken from
Brand 2, 7% of them leaked under stress.
Test the claim that there is no difference in the leakage rate for the two brands. Use a
level of significance α = 0.10
Hypotheses – H0: p1 = p2 H1: p1  p2
Level of significance α = 0.10
P 1−P2
Z= n1 p1+ n2 p 2
Test statistic
√ p(1− p) p(1−p) where p=
n1
+
n2
n1+n 2
Decision Rule: Reject H0 if −Z α < Z−value< +Z α i.e. – Z0.05 < Z < Z0.05
2 2
Calculation: n1 = 100 p1 = 0.09 n2 = 80 p2 = 0.07

100 x 0.09+ 80 x 0.07 14.6
p= = =0.081
100+ 80 180
p 1−¿ p 0.09−0.07
Z= 2
= ¿
√ p(1− p) p(1−p)
n1
+
n2 √ ( 0.08 ) (0.92) ( 0.08 ) ( 0.92)
100
+
80
0.02 0.02 0.02
¿ = =
√0.000736+ 0.00092 √ 0.001656 0.041
= 0.4878
Z α =1.664
2
Decision: Since Z value < Z α ,the null hypothesis H0 will be not be rejected
2
Conclusion: Based on the analysis of the available data, it will be concluded that the two
brands of gloves have the same strength.
ANALYSIS OF CATEGORICAL DATA
Categorical (or qualitative or attribute) data are data that can be separated into different
categories (called cells) that are distinguished by non-numerical based characteristics
experiments that result in categorical data re multinomial. A multinomial experiment can
be likened to a binomial experiment except that unlike a binomial experiment that has
two categories, in a multinomial there are more than two categories.
Conditions for multinomial experiments
i. The number of trials is fixed
ii. The trials are independent
iii. All outcomes of each trial must be classified into exactly one of several
different categories
iv. The probabilities for the different categories remain constant for each trial
THE CHI-SQUARE DISTRIBUTION
0 χ2 distribution
If independent samples of size n are selected at random from a normally distributed
population with variance σ2, the sample variance s2 obtained from the samples has a
( n−1 ) s 2
statistic χ2 = 2 This statistic is said to be Chi-square distributed.
σ
The Chi-square statistic is used in the analysis of categorical data
Properties of Chi-square Distribution
- Unlike the normal and t distributions, the Chi-square distribution is not symmetric
- Values of Chi-square statistic can be zero or positive but not negative
- The Chi-square distribution is dependent on number of degrees of freedom
- As number of degrees of freedom increases, the chi-square distribution
approaches the normal distribution
Uses of the Chi-square distribution
(i) To test goodness of fit – to test the hypothesis that an observed frequency
distribution fits some claimed distribution.
(ii) To test for independency in a contingency table –to test that there is no
association between row variables and column variables in a contingency
table.
(iii) To test for homogeneity between samples –to test the claim that different
populations have the same proportions of some characteristics.
The probability distribution function of a random variable Y which is χ 2 distributed is
1
given as f(y) = n
2 ( )
−1 !
2-n/2 y(n/2)-1 e-y/2
Requirements for the use of the χ2 statistic

i) The data should be randomly selected
ii) The sample data should consist of frequency counts for each of the different
categories
iii) The sample data should come from a multinomial experiment.
iv) For each category, the expected frequency should not be less than 5. It is not
necessary for observed frequency to be more than 5.
Yates correction for continuity
The Chi-square distribution is continuous, but the test statistic χ 2 is used for discrete
variables in the contingency table. Yates correction for continuity in cells with an
expected frequency of less than 10 (i.e. E < 10) for a 2 x 2 contingency table is
2
(|O−E|−0.5)
sometimes used to make with Yates correction χ2 =
E
1) Test for goodness of fit
Procedure: i) State the Null hypothesis H0. H0 is usually stated that. The sampling
distribution agrees with the hypothetical (theoretical) distribution
ii) Determine the level of significance α
2
(O−E)
iii) The test statistic is χ2 = ∑
E
where O is represents the observed frequency of an outcome
E represent the expected frequency of an outcome
iv)The Decision rule is to reject H0 if the calculated χ2 value > χ 2v, α where v = degree of
freedom and α is level of significance.
n
If all expected frequencies are equal then E= , otherwise E = np for each category
k
where n = total number of observations
k = number of different categories
p = proportion of observation in each category
Example 1
The table below shows the total number of fatalities from motorbike accidents for a
period of one year on different days of the week.
Day Sun Mon Tue Wed Thu Fri. Sat
Number of fatalities 30 20 18 21 22 29 35
The sample of motorbike accidents was randomly selected. Test the claim that accidents
occur with equal frequency on the different days. Use a level of significance α = 0.05
Solution
Hypotheses – H0: number of fatalities is the same on all the days m1 = m2 = m3 = … = m7
H1: number of fatalities is not the same for at least one of the days
Level of significance: α = 0.05
2
(O−E)
Test statistics χ = ∑
2
E
Decision Rule: Reject H0 if χ2 value > χ 26,0.05 =12.952
Calculation
n 175
E= = =25
k 7
Expected for each day = 25
Day No of fatalities O Expected no O – E (O - E)2 (O−E)2
of fatalities E E
Sunday 30 25 5 25 1
Mon 20 25 -5 25 1
Tue 18 25 -7 49 1.96
Wed 21 25 -4 16 0.64
Thu 22 25 -3 9 0.36
Fri 29 25 -4 16 0.64
Sat 35 25 10 100 4.00
Total 175 175 9.60
2
χ 6,005=12.592 χ2 value = 9.60
Decision Since the χ 2−value (=9.60) < χ 26,005 (=12.592)
The null hypotheses will not be rejected
Conclusion: Based on the analysis of the sample data, there is no sufficient basis to
conclude that the number of fatalities on the days of the week differ significantly.
Example II
In Genetics, it is claimed that if a couple are both AS genotype, then their offspring could
be AA, AS or SS with probabilities 0.25, 0.50 and 0.25 respectively. The data below lists
the genotype frequencies of 180 randomly selected offspring from couples that are both
AS genotype.
Genotype AA AS SS
Frequencies 44 110 26
Tests the genetics claim at a level of significance α = 0.10

Solution
Hypothesis:
H0: p1 = 0.25 p2 = 0.50 p3 = 0.25
H1: At least one of the proportions is different from the claimed value
(O E −E)2
Test statistic: χ2 = ∑
E
Decision: Reject H0 if χ2 value > χ 22,0.01 (= 9.21)
Calculation
Genetics Observed frequency O Expected frequency E (O−E)
2
E
AA 44 0.25 x 180 = 45 0.0222
AS 100 0.50 x 180 = 90 1.1111
SS 36 0.25 x 180 = 45 1.8000
Total 180 180 2.9333
2
χ2 value = 2.9333 χ 2,0.01 (= 9.21)
Decision: Since χ2 value (= 2.9333) < χ 22,0.01 (= 9.21), the null hypothesis Ho will not be
rejected
Conclusion: Based on the analysis of the data, it will be concluded that the genetics claim
is true.
2) Test of independence: Use of the Contingency Table
In the test of independence the concern is to find the dependency relationship between
two or more attributes. The population and sample are classified into several attributes.
The test of interest is to know whether there is any dependency relationship between the
attributes. The test does not indicate the level of discrepancy for the contingency table.
Total∈Column x Total ∈raw
Expected value in each cell =
Overall Total
Example 1 A study was conducted to test whether smoking has any relationship with
gender.
A random sample of 100 persons were selected and the distribution is as shown:-
Male Female Total
Smoker 30 10 n1. = 40
Non Smoker 20 40 n2. = 60
Total 50n.1 50n.2 100
n1. x n .i 40 x 50
n11 = = = 20
n .. 100
n xn 40 x 50
n12 = 1. .2 = = 20
n 100
Similarly n21 = 30 n22 = 30
2 2 2
(o −e ) (30−20) (10−20)2 (20−30)2 ( 40−30)2
2
χ2 = ∑ ∑ ij ij = + + + = 16.66
i=i j=i n 20 20 30 30
Degree of freedom = (r-1) (c-1) = (2-1) (2-1) = 1 χ21,0.05 = 3.84

Since degree of freedom is = 1, we may use Yates adjustment
χ2 = ¿ = 15.04
Decision: Since χ2 value (=15.04) > χ21,0.05 (=3.84), the null hypothesis Ho is rejected.
Conclusion: Smoking is gender dependent.
Example 2 (Assignment)
To study whether there is a relationship between gender and colour preference, a random
sample of 100 persons selected shows the following distribution:
Gender
Colour Male Female
Pink 10 20
White 20 10
Blue 30 10
Total 60 40
Test the claim that there is no significant difference between colour preference and
gender. Use a level of significance α = 0.053)
Test of homogeneity
This is a test to show that there is no difference between two or more samples which
means that samples are from similar or the same distribution. This test is referred to as
test of homogeneity in statistics.
The Null hypothesis is that the samples come from same population (or population of
same distribution).
Example I
Two types of test (PET/CT and MRI) were used to identify the stage of tumor in cancer
patients. 50 cancer patients were tested with the two tests and the results of the diagnosis
is as given below:
PET/CT
Correct Incorrect
Correct 36 1
MRI
Incorrect 11 2
Test the claim that there is no difference in the accuracy of the tests. Use a level of
significance α = 0.05
Solution
Calculation: This example is a test of homogeneity.
PET/CT
Correct Incorrect Total
Correct 36 (34.78) 1(2.22) 37
MRI
Incorrect 11(12.22) 2(0.78) 13
Total 47 3 50
Expected values E are the figures in bracket
n i . n. j n1. x n.1 37 x 47
E= e.g. n11= = =34.78
nij total 50
n2. x n.1 13 x 47
n21= = =12.22
total 50
37 x 3 13 x 3
Similarly n12= =2.22 and n22= =0.78
50 50
2 2 2 2
2 ( 36−34.78) ( 11−12.22) (1−2.22) (2−0.78)
χ= + + +
34.78 12.22 2.22 0.78
= 0.0428 + 0.1218 + 0.6705 + 1.9082 = 2.7433
2
Degree of freedom = (2-1)(2-1) = 1 x 1 = 1 χ 1,0.05 =3.841
Decision
2
Since χ2 value (=2.7433) < χ 1,0.05 ( ¿ 3.841 ), the null hypothesis will not be rejected
Conclusion: Based on the analysis of the available data, it can be concluded that there is
no significant difference between diagnosis made using the two tests.
Note: Before computing χ2, the Yate’s correction can be made for cells with expected
count
(E) < 2
Example II
Two different schools were to be considered in terms of their performances. Sample of
100 students each were taken from each school and a test was administered on them.
The grades obtained are as given below:
Grade School
1 2
A 10 10
B 20 10
C 30 40
D 20 30
F 20 10
100 100
Is there any difference between the grades of the two schools.
School I II
Grade I II ( oi−ei ) 2
( oi−ei )2
ei ei
A 10(10) 10(10) 0 0
B 20(15) 10(15) 1.7 1.7
C 30(35) 40(35) 0.7 0.7
D 20(25) 10(25) 1.0 1.0
F 20(15) 10(15 1.7 1.7
5.1 5.1
Calculated χ2 = 5.1 + 5.1 = 10.2

Degree of freedom = (5-1) (2-1) = 4 χ24,0.005 = 9.40
Decision: Since the χ2 value (=10.2) > χ24,0.005 (= 9.40), the null hypothesis H o will not be
accepted.
Conclusion: Based on the analysis of the available data, the claim of no difference in the
two schools cannot be accepted. Therefore, the conclusion is that performance in the
schools differs significantly.
The F Distribution
R.A. Fisher developed the F distribution in the early 1920s. The F distribution was a
transformation of the normal Z distribution which he had earlier developed.
The F distribution is the statistic used to test the equality of two variances.
Given that the null hypothesis is H0: σ1 = σ2 and the alternative hypothesis H1 σ1 > σ2 is a
one-tail test. If we are given α, we shall use the critical value of Fv1, v2, α
2
σ1
Given that σ 1 and σ 2, then F = 2 where
2 2
σ2
n1
2
σ1 = ∑ x i−x ¿2 / n1−1 ¿
i=1
n2
2
σ2 = ∑ x i−x ¿2 / n2−1 ¿
i=1
Example
Suppose two different processes are used to manufacture bulbs. If the life of the light
bulb for the process A is normally distributed with mean µ o and standard deviation σo and
that for B is also normally distributed with mean µ1, and standard deviation σ1. The two
means can be compared by taking a sample from each process.
Let the observation be:-
Sample A Sample B
n = 17 n = 21
X a= 120hr X b = 1300hr
Sa = 60hr Sb = 50hr
2
2 n a Sa 2
σa / nb S b
F = = n a−1
σ 2b
nb−1
17(60¿¿ 2) 21(50¿¿ 2)
= / ¿¿
16 20
51
=
55
= 1.46
F16, 20, 0.05 = 2.18
F < F16, 20, 0.05 ∴ Ho is not rejected σ a = σb
2 2 2
ii) On the other hand if H0: σ a = σ b vs H, σ a ≠ σ bthen we shall compare the
2
calculated F with F16,20,0.025 (= 2.46) and since the F value (=1.46) < 2.46, ∴ the
none hypothesis is not rejected.
LINEAR REGRESSION ANALYSIS

Linear Regression Analysis is one of the most frequently used techniques in research to
find the relationship between two or more variables that are somehow related. For
example we may want to know the association between height and weight or association
between gestational age and weight of baby at birth etc.
Regression analysis involves determination of closeness of variables, level of their
closeness, what type of relationship and hence determining an estimate of one variable
based on the knowledge about the other..
The Linear Regression Equation

Simple linear regression involves two variables X and Y. In determining the linear
regression equation, it is already assumed that a linear relationship exists between the two
variables. Assuming that there is a relationship between two variables X and Y and that X
is the independent variable while Y is the dependent variable, this implies that if we
know X then Y will take a value depending of that value of X. The regression equation
also assumes a linear relationship. The basic linear regression equation is given as
Y = a + bX
where a =Y - b X and b =
∑ ( X− X ) (Y −Y ) or b = ∑ ( X− X ) (Y −Y )
∑ ( X −X )2 ∑ ( X −X )2
The Derivation
From y = a + bx
Sum b.s of (1) y = a + b x
 y = na + bx ….. (2)
Also m.b.s of (1) by x and sum: xy = ax + bx2 …. (3)
From equation (2)

∑ y =a+b ∑ x → y=a+ b x
n n
 a= y−b x …. (4)
Substitute a in (4) into equation (3)
∴ ∑ xy= (∑ n
y
−b
n )
∑ x ∑ x+ b ∑ x 2
∑ x ∑ y −b (∑ x ) +b ∑ x 2
2
¿
n n
2
∑ xy − ∑ ∑ =b ∑ x 2− ∑
x y b( x )
n n
n ∑ xy−∑ x ∑ y=nb ∑ x −b( ∑ x )

2 2
n ∑ xy−∑ x ∑ y=b n ∑ x −(∑ x) [ 2 2

]
n ∑ xy−∑ x ∑ y
Or we could also derive b=
n ∑ x −(∑ x)
2 2
Note that a linear regression can be determined by plotting a scatter diagram of X against
Y and using the free hand method to draw the line of best fit. However, using the free
hand method does not usually give accurate results.
Worked Examples
1. If it is believed that the weight of a person is related to his height, determine the
regression equation for a sample of 9 students whose heights and weights are as given
below:-
Height 1.3 1.4 1.4 1.5 1.5 1.5 1.6 1.6 1.7
Weight 130 135 145 132 150 168 165 180 170
Computation:-
Student X Y X
2
XY
1 1.3 130 1.69 169
2 1.4 135 1.96 189
3 1.4 145 1.96 203
4 1.5 132 2.25 198
5 1.5 150 2.25 225
6 1.5 168 2.25 252
7 1.6 165 2.56 264
8 1.6 180 2.56 288
9 1.7 170 2.89 289
Total 13.5 1375 20.37 2077
X = 1.5 Y =152.78 b = 9 ¿ ¿ = 120.83 a =152.78 -120.83x1.5 = -28.47
The linear regression equation is y = -28.47 + 120.83x
2. A small study is conducted involve 17 infants to investigate the association between

gestational age at birth, measured in weeks and birth weight, measured in grams is
shown in table below
S/ Gestational Birth Weight

No Age (weeks) x (grams) y
1 34.7 1895
2 36.0 2030
3 29.3 1440
4 40.1 2835
5 35.7 3090
6 42.4 3827
7 40.3 3260
8 37.3 2690
9 40.9 3285
10 38.3 2920
11 38.5 3430
12 41.4 3657
13 39.7 3685
14 39.7 3345
15 41.1 3260
16 38.0 2680
17 38.7 2005
Computation:-
S/ Gestational Birth Weight x2 y2 xy
No Age (weeks) x (grams) y
1 34.7 1895 1204.09 3591025 65756.5
2 36.0 2030 1296 4120900 73080
3 29.3 1440 858.49 2073600 42192
4 40.1 2835 1608.01 8037225 113683.5
5 35.7 3090 1274.49 9548100 110313
6 42.4 3827 1797.76 14645929 162264.8
7 40.3 3260 1624.09 10627600 131378.0
8 37.3 2690 1391.29 7236100 100337
9 40.9 3285 1672.81 10791225 134356.5
10 38.3 2920 1466.89 8526400 111836
11 38.5 3430 1482.25 11764900 132005
12 41.4 3657 1713.96 13373649 151399.8
13 39.7 3685 1576.09 13579225 146294.5
14 39.7 3345 1576.09 11189025 132796.5
15 41.1 3260 1689.21 10627600 133986
16 38.0 2680 1444 7182400 101840
17 38.7 2005 1497.69 4020025 77593.5
Total 652.1 49334 25173.21 150934928 1921112.6
X =38.36=38.36 ❑
❑
17 (1921112.6 )−( 652.1 ) (49334)

b= 2 = 180.14
17 (25173.21 ) −(652.1)
a =Y -b X = 2902 -180.14 x652.1 = - 114.57
The regression equation is y = -114.57 +180.14x
Correlation Analysis
Correlation deals with the closeness of variation between x and y. The closeness of variation
is a said to vary close by if changes in x and y are approximately proportional. Covariability
is perfect when all the points fall on a straight line. Convariability is determined by
Cov(x , y) Cov ( X , y)
ρ= = … … ..(1)
√ var x √ var y σxσy
Equation (1) is called the correlations coefficient formula
Cov (x, y) = var (x + y) Var (x + y) ≥ 0 always.
Thus ρ≤ 1  - 1 ≤ ρ ≤ 1 are the values which ρ can attain.
Sample Correlation Coefficient
When population is large, it is usually necessary to take a sample and it is from the sample
that ^ρ which is an estimate of ρ is determined
1
n−1
∑ ( x−x ) ( y − y)
^ρ =r=
√
1
n−1
∑ ( x−x)2 .
√
n−1
1
∑ ( y − y)2
∑ ( x −x ) ( y− y ) = n ∑ xy−∑ x ∑ y
¿
√¿¿¿
√ ∑ (x−x )2 ∑ ( y− y )2
r is the MLE of ρ and it is called the sample correlation coefficient.

n ∑ x y−∑ x ∑ y
r can be rewritten as r =
[ n ∑ x −(∑ x ) ][ n ∑ y −(∑ y ) ]
2 2 2 2
Worked Example:
In the example 2 on gestation period and birth weight given above, calculate the sample
correlation coefficient and draw appropriate conclusion on the association between gestation
period and infant birth weight.
Computation:
17 (1921112.6 )−( 652.1 ) ( 49334)
r= = 0.82
√ {17 ( 25173.21 )−¿ ¿ ¿
Conclusion:- The value of r is positive and close to 1 which implies that there is a strong
relationship between gestation period and birth weight. The longer the gestation period the
more the baby’s birth weight.
Rank Correlation
There are cases where a dependency between 2 variables x and y can be observed but the
distribution is not known. If this is so, then methods of calculating r as before cannot be used.
6∑d
2
Here the spear man’s rank correlation is used r =1− 2 where

n(n −1)
d = difference between ranks of x and y
The spearman’s rs is a non-parameters (or distribution free) statistic.
Multiple Regression Analysis
M.R.A is used for testing hypotheses about the relationship between a dependent variable y
and 2 or more independent variables and for prediction.
For e.g a 3 variable regression model is written as y=bo +b 1+ x 1 +b2 x 2+ μ 1. In Addition to the
assumptions c hold for the simple regression model, there is no exact linear relationship
between the x values.
OLS can be obtained by minimizing the sum of squared residuals.
∑ e21=∑ ( y ¿ ¿ i− ^y i )2=∑ ( y i−b^ o−b^ 1 x 1i −b2 x 2 i )2 ¿
As in the SLR we can then obtain:-
( ∑ x 1 y ) (∑ x 2 ) −∑ ( x 2 y ) (∑ x1 x 2)
2
b^ 1=
(∑ x 21 )(∑ x 22 )−(∑ x 1 x 2 )
2
(∑ x 2 y ) ( ∑ x 21 ) −∑ ( x 1 y ) (∑ x1 x 2)
b^ 2=
(∑ x 21 )(∑ x 22 )−(∑ x 1 x2 )
2
b^ 0= y− b^ 1 x 1−b^ 2 x 2
where x 1=x 1−x 1
and x 2=x 2−x 2
Worked Example during the lecture in class.

PGD Sta - Sta 703

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PGD Sta - Sta 703

Uploaded by

Copyright:

Available Formats

STATISTICAL INFERENCE

Sampling Distribution of the Mean

As n increases i.e. as n  ∞, the sampling distribution of the mean approaches normal

µ = X ± 2.4 Z1-α = 50 + 2.4 x 1.96 = 50 + 4.7

Since n> 0.05N, σ x =

√ since sample sizes are large

Calculation: n1 = 100 p1 = 0.09 n2 = 80 p2 = 0.07

THE CHI-SQUARE DISTRIBUTION

Requirements for the use of the χ2 statistic

Tests the genetics claim at a level of significance α = 0.10

Degree of freedom = (r-1) (c-1) = (2-1) (2-1) = 1 χ21,0.05 = 3.84

Expected values E are the figures in bracket

Calculated χ2 = 5.1 + 5.1 = 10.2

LINEAR REGRESSION ANALYSIS

The Linear Regression Equation

From equation (2)

n ∑ xy−∑ x ∑ y=nb ∑ x −b( ∑ x )

n ∑ xy−∑ x ∑ y=b n ∑ x −(∑ x) [ 2 2

X = 1.5 Y =152.78 b = 9 ¿ ¿ = 120.83 a =152.78 -120.83x1.5 = -28.47

The linear regression equation is y = -28.47 + 120.83x

2. A small study is conducted involve 17 infants to investigate the association between

S/ Gestational Birth Weight

17 (1921112.6 )−( 652.1 ) (49334)

r is the MLE of ρ and it is called the sample correlation coefficient.

Here the spear man’s rank correlation is used r =1− 2 where

Worked Example during the lecture in class.

You might also like