Inferential Statistics

Inferential statistics
06/17/2024 RAM KRISHNA TAMANG 1

Determination of sample size
In the sampling analysis the most ticklish question is “what should be
the size of sample or how large should be sample size?
If sample size is too small it may not serve to achieve the objectives and
if it is too large, it represent population more precisely but it will waste
resources, huge cost and more time for gathering the information.
As a general rule the determination of approximate sample must be of
an optimum size i.e. it should neither be excessively large not too small.
There are two alternative approaches for the determining the size of
sample
• Determination of sample size by using mean
• Determination of sample size by using proportion

1. Determination of sample size by using
mean
Let be a mean of random sample of size Example:
The mean systolic blood pressure of a certain group of
n drawn from a population having mean people was found to be 125 mm of Hg with standard
µ and standard deviation. Then required deviation of 15 mm of Hg. Calculate the sample size to
minimum sample size is given by verify the result at 5% level of significance if errors do
not exceed 2 mm of Hg. Also find sample size if sample
n0 = if population size is unknown is selected from population is 50000.
Solution: Given,
n = if population size is known
Mean of systolic blood pressure (µ) = 125 mm of Hg
Where, Standard deviation of systolic blood pressure () = 15
n = sample size mm of Hg
Level of significance (α) = 5%
d = = acceptable error or estimated Desired error (d) = 2 mm of Hg
error or margin of error or desired error Population size (N) = 50000
= population standard deviation Sample size (n) =?
We have,
= standard normal variate or critical n0 = = = = 216.09 216
value or Z value at level of significance If population size is known then required minimum
N = population size. sample size is
n = = = 215.07 215

2. Determination of sample by using
proportion
A survey estimated that 30% of Americans aged 16 to 20
Let p be the proportion of sample of drove under the influence of drugs or alcohol. A similar
size n drawn from a population having survey is planned for Nepal; a researcher wants 95%
confidence interval to have a margin of error of 0.04. How
proportion of same characteristics P many adolescents should be included in the study? If there
then the required sample size is given are 10000 populations, how many adolescents need to
include in the study?
by Solution: Given
n0 = if population size is unknown The proportion of drive under the influence of drugs or
alcohol (P) = 30% = 0.3
n = if population size is known And Q = 1- P = 1 – 0.3 = 0.7
Margin of error (d) = 0.04
Where, Confidence interval (1 – α) = 0.95
n = sample size Level of significance (α) = 0.05
Population size (N) = 10000
d = = acceptable error or estimated Required sample size (n) =?
error or margin of error or desired error We know, the required minimum sample size is given by
n0 = = = = 504.21504
= standard normal variate or critical And if total population size is 10000 then minimum required
value or Z value at level of significance sample size is
n = = = 479.82480
N = population size.

A cigarette manufacturer wishes to know to conduct 1. The mean pulse rate of population is believed to be
a survey using a random sample to estimate the 72 per minute and standard deviation of 10 beats.
proportion of smokers who would switch to the Calculate the minimum sample size to verify this if
company’s developed low bar brand. The sampling allowable error ±2 beats at 95% confidence level.
error should not more than 0.02 above or below the 2. A vitamin manufacturing concern wants to estimate
actual proportion with 99% degree of confidence. the average amount of purchase of its product in a
Determine the optimum sample size. month by the patients. If the standard deviation is
Rs.10, find the sample size if the maximum error is not
Solution:
to exceed Rs.3 with a probability of 0.99.
Proportion of smoker (P) = 50% = 0.5 3. Hookworm prevalence rate was 30% before the
And Q = 1 – P = 1 – 0.5 = 0.5 specific treatment and adoption of other measure,
Sampling error (d) = 0.02 calculate the size of sample required at find the
prevalence rate now of allowance error of 10% and
Confidence level (1 – α) = 0.99
20%.
Level of significance (α) = 0.01 4. A survey is planned to determine what proportion of
Sample size (n) =? the higher secondary students have abused on drug. If
We know, prevalence is not available from previous studies, a
Optimum sample size confidence interval is 0.95 and allowable error is 0.04.
Determine the appropriate sample size. What sample
n= = = size would be required if 99% confidence interval were
= 4147.364147 desired?

Estimation
The principle of sampling theory is used statistical • Interval estimation
inference. Statistical inference is the procedure by
which we reach a conclusion about a population on An interval estimate consists of two
the basis of the information contained in a sample numerical values defining the range is also
drawn from that population. The statistical called interval constructed around the
technique of estimating unknown population point estimate. In another words, if an
parameter from sample statistic is known as
estimate of population parameter given by
estimation. There are two types of estimation
• Point estimation
two distinct numbers between which the
• Interval estimation parameter may be considered to lie then
• Point estimation that statistical technique of estimation is
If a single value which represents to estimate the known as interval estimation.
value of unknown population parameter from the
information of sample, the procedure is referred to
The criteria for good estimator are
as the point estimation. Thus the point estimation
provides exact estimated under consideration. e.g. • Un biasedness
sample mean () which is used for estimating the • Efficiency
population mean and also known as estimator of
population mean(µ). Similarly sample proportion (p) • Consistency
is the point estimator of population proportion (P). • Sufficiency

• Un biasedness • Efficiency
An estimator is said to be unbiased if the Efficiency refers the sampling validity of a sample
estimate. If two estimators of a given population
expected value of sample statistic is equal to parameter are both unbiased. The one with smaller
the corresponding population parameter. Let t variance for a given sample size is defined as being
is sample statistic, be the respective relatively more efficient. Let t1 and t2 are two unbiased
population parameter and expected value of t estimators of population parameters then t1 is relatively
is E(t) then t is said to be unbiased it E(t) = . more efficient with respect to t2 if var(t1) < var(t2).
The sample mean () is the unbiased estimator • Consistency
of population mean µ i.e. E() = µ. Similarly, An estimator is said to be consistent if the value of
sample proportion p is unbiased estimator of estimator is nearer to population parameter as sample
population proportion P i.e. E(p) = P. size increases sufficiently large or the probability of the
parameter being approaches to 1 as n approaches to
But sample variance s2 is not unbiased infinite. i.e. t as n or = 1.
estimator of population variance Therefore S2
is unbiased estimator of population variance • Sufficiency
Where, s2 = sample variance = An estimator t is said to be a sufficient if it uses all the
And S2 = unbiased estimator of population information contained in the sample to estimate the
population parameter. Sample mean, sample
variance = .
proportion is sufficient estimator of population mean
and population proportion.

Confidence interval estimates of population
parameters
In an interval estimation of the population • Where 1 – is the confidence
parameter , if we can find two quantities t1
and t2 based on information of sample
level or confidence coefficient.
observation drawn from the population such The confidence level refers to
that the unknown population parameter is the probability that any
included in the interval [t1, t2]. Then such random value drawn from
interval used to estimation unknown
population parameter is known as confidence
population will lie within the
interval for population parameter. It is also confidence limit. e.g. 95%
known as confidence limits, t1 is called lower confidence level indicates that
confidence limit and t2 is called upper there is 95% probability of
confidence limit.
estimated random value will lie
The confidence interval has specified
within the confidence limits
confidence or probability of correctly and remaining 5% risk to lie the
estimating the true value of population estimator value on the outside
parameter
of the confidence limits
i.e. p(t1 t2) = 1 –

Confidence interval estimate of population
mean from large sample size
Let sample mean of sampling distribution of sample As a part of the National Health and Nutrition Examination
mean of large sample size n drawn from normally Survey (NHANES), haemoglobin level was checked for a
distributed population having mean µ and standard sample of 1140 men age 70 or above. The sample mean of
deviation. It is also assumed that the sample mean is haemoglobin was 145.3 per litre and standard deviation was
normally distribution for large sample size. Then the 12.87 per litre. Use the data to construct a 90%, 95% and 99%
interval estimate of the population mean µ by using the confidence interval for population mean.
sample mean of the sampling distribution is sample Solution: Given,
mean is given by Sample size (n) = 1140
Confidence interval = [] Sample mean of haemoglobin () = 145.3 per litre
= [] if population size is unknown and samples are drawn Sample standard deviation (s) = 12.87 per litre
with replacement At 90% confidence level
= [] if population size is known and samples are drawn Confidence level (1 – α) = 0.9
without replacement Level of significance (α) = 0.1
Where, Then 90% confidence interval for population mean of
haemoglobin level is given by
= critical/tabulated value at level of significance
= []
= standard error of sample mean
= [145.3 1.645]
= population standard deviation
= [145.3 0.627]
n = sample size
= [145.3 – 0.627, 145.3 + 0.627]
N = population size = [144.76, 145.93]
Hence the population mean of haemoglobin level is lies
Note: if population standard deviation is unknown then we between 144.76 per litre and 145.93 per litre at 90%
estimate from sample unbiased estimate i.e. = S confidence level

Similarly, at 95% confidence level Confidence interval estimate of population mean from small
Confidence level (1 – α) = 0.95 sample size
Let be the sample mean of size n i.e. n ≤ 30 having selected
Level of significance (α) = 0.05
from normally distributed population having mean µ and
Then 95% confidence interval for population mean of unknown variance. Then confidence limits of population mean
haemoglobin level is given by by the sample mean of sampling distribution of sample mean
= [] for small sample size is given by
= [145.3 1.96] Confidence interval = []
= [145.3 0.747] =[
= [145.3 – 0.747, 145.3 + 0.747] =[
Where,
= [144.55, 146.05]
=
Hence the population mean of haemoglobin level is lies between
= critical/tabulated value of t-distribution at α level of
144.55 per litre and 146.05 per litre at 95% confidence level.
significance and n-1 degree of freedom
s = sample standard deviation = =
And, at 99% confidence level S = unbiased estimator of population standard deviation = =
Confidence level (1 – α) = 0.99 n – 1 = degree of freedom
Then 99% confidence interval for population mean of
haemoglobin level is given by
= []
= [145.3 2.576]
= [145.3 0.982]
= [145.3 – 0.982, 145.3 + 0.982]
= [144.32, 146.28]
Hence the population mean of haemoglobin level is lies between
144.32 per litre and 146.28 per litre at 99% confidence level.

Example
• A random sample of 12 records gave the average length of 164
mm with standard deviation of 4 mm. Find the 90% and 95%
And at 95% confidence level
confidence limits for population.
Solution:Given Level of significance (α) = 0.05
Sample size (n) = 12
Sample average length () = 164 mm Then 95% confidence interval for average
Sample standard deviation (s) = 4 mm length of population is given by
At 90% confidence level
Confidence level (1 – α) = 0.9 =[
Then 90% confidence interval for average length of population is given =[
by
=[
=[
=[ =[
=[
=[ = [164 2.66]
= [164 2.166]
= [164 – 2.166, 164 + 2.166]
= [164 – 2.66, 164 + 2.66]
= [161.83, 166.17] = [161.34, 166.66]
Hence average length of population is lies between 161.83 mm and
166.17 mm at 90% confidence level. Hence average length of population is lies
between 161.34 mm and 166.66 mm at
90%
Sample mean = = = 97.2
A random sample of 10 And S =
professors had the following = = 14.27
level of glucose (F): 70, 120, at 95% confidence level

110, 101, 83, 88,
Glucose (X) X 2
95, 98, 107 Level of significance (α) = 0.05
70 4900
and 100. 120 Find the 1440095% Then 95% confidence limits for population mean
of glucose level is given by
confidence 110limit in which12100 =[
101 10201 =[
most of glucose
83 levels of 6889
the =[
samples of these
88 professors
7744 =[
95 9025 = [97.2 10.21]
lie. 98 9604 = [97.2 – 10.21, 97.2 + 10.21]
107 11449 = [86.99, 107.41]
100 96312
Hence average length of population is lies
Solution: between 86.99 F and 107.41 F at 95% confidence
level.

Confidence interval estimate for population
proportion from sample proportion
Let p be the sample proportion of success or Example :In a sample of size 36, the proportion of smoker
certain characteristic of size n randomly drawn is 15%. Determine 90% and 95% confidence limits within
which the population proportions most lie.
from normally distributed population having
corresponding population proportion P then the
Solution: Given,
confidence interval is given by
Sample size (n) = 36
Confidence interval = [p ] Sample proportion of smoker (p) = 15% = 0.15
= [p ] if population size is unknown and samples And q = 1 – p = 1 - 0.15 = 0.85
are drawn with replacement At 95% confidence level
= [p ] if population size is known and samples are Confidence level (1 – α) = 0.95
drawn without replacement Level of significance (α) = 0.05
Where, Then 95% confidence interval for population proportion of
smoker is given by
P = Population proportion of success = [p ]
Q=1–P = [0.15 ]
p = sample proportion of success = [0.151.960.0595]
n = sample size = [0.15 0.117]
= [0.15 – 0.117, 0.15 + 0.117]
N = population size
= [0.033, 0.267]
Note: if population proportion is unknown then Hence the confidence limits of population proportion of
we estimate P from sample proportion i.e. = p. smoker are 0.033 and 0.267 at 95% confidence level.

Hypothesis testing
• In inferential statistics, the population means entire mass
of data whose characteristics are generally unknown.
• The testing of hypothesis is one of the important topics of
inferential statistics that enables us to test the validity of
some claim by using sampling data.
• A hypothesis is defined by the Webster as “a tentative
theory or supposition provisionally adopted to explain
certain fact and to guide in the investigation of others”.
• Thus, a hypothesis means a pre-assumption about the
population parameter which needs to test in order to
ascertain whether it is true or false.
There are two types of hypothesis involved in a testing a hypothesis
which are
• Null hypothesis ()
• Alternative hypothesis ()
1. Null hypothesis()
• The supposition about the population parameter is known as null
hypothesis.
• Null hypothesis means no difference or zero difference between
the true population parameter and statistic computed from
selected sample data. A hypothesis means assumption therefore
null hypothesis means pre-assumption or no difference.
• According to prof. R.A. Fisher “null hypothesis is the
hypothesis which is tested for the possible rejection under
the assumption that it is true”. It is denoted by and set up
as follows (e.g. z-test)
• Null hypothesis: = i.e. there is no significance difference

the sample mean and population or the samples are
selected from normally distributed population having mean.
• e.g. Null hypothesis: = 120 mm of Hg i.e. the samples are
drawn from the population of males with mean systolic
blood pressure 120 mm of Hg.

2. Alternative hypothesis ()
• If the decision maker or researcher rejects the
null hypothesis as the basis of sample
information, another hypothesis is accepted
which is complementary to null hypothesis.
• It is known as alternative hypothesis and
alternative hypothesis is simply opposite of
null hypothesis.
• Alternative hypothesis is denoted by and set
up as follows for above example
• Alternative hypothesis: (for two tailed test) i.e. there is significance
difference between the sample mean and population mean or the
samples are selected from normally distributed population having
mean is not equal to .
Or,
• Alternative hypothesis: > (for right tailed test) i.e. the sample mean is
significantly greater than population mean or the samples are
selected from normally distributed population having mean
significantly greater than.
Or,
• Alternative hypothesis: < (for left tailed test) i.e. the sample mean is
significantly smaller than population mean or the samples are
selected from normally distributed population having mean
significantly smaller than.
Error in hypothesis testing
• The testing of hypothesis is a statistical procedure of decision
making whether to reject or accept the null hypothesis () on
the basis of information contained in the sample data which
are drawn from the population.
• In sample data there is always lack of complete information
about the entire population.
• In such situation, when someone researcher makes decision
and decision may be wrong. Such wrong decisions are known
as error in hypothesis. In testing of hypothesis there are two
types of errors in hypothesis
• Type I error: rejection of true hypothesis
• Type II error: acceptance of false hypothesis
State of nature Null hypothesis is Null hypothesis is
true false
Decision
alternative
Accept null Correct decision Wrong or error

hypothesis (1 - ) decision
(Type II error )
Reject null Wrong or error Correct decision (1 - )

hypothesis decision
(Type I error )

Type I error (α)
• It is the error in hypothesis of rejecting null hypothesis when
null hypothesis is true.
• In other words, the rejection of true hypothesis is known as

type I error.
• The probability of type I error is denoted by and given by
= probability (type I error)

= probability (rejection of null hypothesis when it is true)
= probability (rejects / is true)
It is also known as level of significance.
Type II error (β)
It is the error in hypothesis of accepting null hypothesis
when null hypothesis is false.
• In other words, acceptance of false hypothesis is known

as type II error.
• The probability of type II error is denoted by and given by
= probability (type II error)

= probability (accept of null hypothesis when it is false)
= probability (accepts / is false)

Level of significance (α)
• The probability of rejecting the null hypothesis when null
hypothesis is true is known as level of significance.
• In another words, the size if rejection of null hypothesis
when it is true known as level of significance.
• It is denoted by α and defined as

= level of significance
= probability (rejection of null hypothesis when it is true)
= probability (rejects / is true)
It is also known as probability of type I error.
P-value
• P –value is the smallest level of significance at which null hypothesis () would
be rejected when a specified test procedure is used on a given set of data.
• So, the p-value is the probability that the test statistic is making as extreme as
the value of test statistic calculated from the observed set of data.
• In other words, p-value is the probability of greater or equal to specific test in

the right tail.
• Similarly, exceeding or below the specific test in the left tail.

• If the p-value is less than level of significance () then we reject the null
hypothesis ().
• It is simply a measure of how likely the data were to have occurred by chance
assuming null hypothesis is true

Critical region
• It is also called the rejection region. The set of all possible
values of statistic is divided into two regions i.e. one leading to
the rejection of null hypothesis () and other to the acceptance
of null hypothesis () based on the level of significance () and
alternative hypothesis ().
• The region which leads to the rejection of null hypothesis () is

known as rejection region while those region which leads to the
acceptance of null hypothesis () is known as acceptance region.
• If the test statistic falls into the rejection region the null
hypothesis is rejected. If the test statistic fall into the
acceptance region then null hypothesis () is accepted.
1-α
Acceptance region
1-α
1-α

Critical value
• The value of test statistic which separates critical region and acceptance
region is called as critical value. It is also known as significant value or
tabulated value.
• The critical value is based on the level of significance () and alternative
hypothesis ()
• The critical value of Z for one tailed test at level of significance is as same
as the critical value of Z for a two tailed test at the level of significance 2.
For right tailed test, P(Z > ) =
For left tailed test, P(Z < ) =
For two tailed test, P( > ) =
Or, P(Z > ) + P(Z < ) =
Or, P(Z > ) + P(Z > ) =
Or, P(Z > ) =

One tailed test and two tailed test
• One tailed test and two tailed test is depends upon the situation of the critical region
on the tails of standard normal curve which is symmetrical about mean and the total
area covered is unity.
• In other words, an alternative hypothesis leads to two alternatives to null hypothesis, it

is known as two tailed test as the critical region is found to be situated on both tails.
• In other words, if direction of difference is not given in the statement of hypothesis

then it is said to be two tailed test.
• If an alternative hypothesis leads to only one alternative to the null hypothesis then it is
known as one tailed test as critical region found to be situated on only one tail i.e.
either left tailed or right tailed of the normal curve.
• In other words, if the direction of difference is given in the statement of hypothesis

then that is known as one tailed test.
General procedure of testing of hypothesis
Step 1: set up null hypothesis
Step 2: set up alternative hypothesis
Step 3: selecting the level of significance
Step 4: identify the sample statistic to be used and its sampling distribution
Step 5: compute test statistic based on sampling distribution
Step 6: obtain the critical value from the appropriate table
Step 7: draw the conclusion
Critical value approach

• If calculated value of test statistic is greater than tabulated value then we reject the null
hypothesis ()and
• If calculated value of test statistic is less than or equal to tabulated value then we
accept the null hypothesis ().
P-value approach
• If p-value < level of significance, then we reject the null hypothesis () and
• If p-value level of significance, then we accept the null hypothesis ().
Parametric tests
• A parametric statistical test is a test whose model
specifies certain conditions about the parameters of
the population from which the samples are drawn.
• Sample statistics will be used to test the hypothesis
that will be made about the certain parameters of
universe. The nature of population distribution from
which the sample drawn is known.
• Z –test, t-test, F-test, ANOVA are the examples of
parametric tests.

Z-test
(Test the significance of large sample size)
• It is important parametric test based upon the assumption of normal
distribution. When the samples are selected from population of
known parameters with sample size more than 30, then Z-test is used.
• We consider that if sample size is more than 30 then sample selected
from non normal population is also approximately normally
distributed.
• Z-test is defined as the ratio of difference between sample statistic
and expected value of sample statistic to the standard error of sample
statistic.
Z = = N (0, 1)
Where t = statistic
E(t) = expected value of test statistic
S.E.(t) = standard error of test statistic
Assumptions of Z-test
• The sample size should be large i.e. n>30.
• The samples are drawn from population having normally distributed.
• The samples are drawn randomly from normally distributed population.
• The sample observations are independent to each others.
• The population standard deviation is known i.e.,
Z-test are used to test

• The significance of single sample mean
• The significance difference between two sample means
• The significance of significance of single sample proportion
• The significance difference between two sample proportions
1. Test the significance of single sample mean
Let us consider sample of size n (i.e.,>30) has been drawn from the
normally distributed population having mean and variance respectively.
Also let be the mean of sample of size n and sampling distribution of
sample mean follows normal distribution with and variance.
The following steps involved in the test of significance of single sample
mean
Null hypothesis: = i.e. there is no significance difference the sample mean
and population or the samples are selected from normally distributed
population having mean.
Alternative hypothesis: (for two tailed test) i.e. there is significance

difference between the sample mean and population mean or the samples
are selected from normally distributed population having mean is not equal
to .
Or,
Alternative hypothesis: > (for right tailed test) i.e. the sample mean is
significantly greater than population mean or the samples are selected from
normally distributed population having mean significantly greater than.
Or,
Alternative hypothesis: < (for left tailed test) i.e. the sample mean is
significantly smaller than population mean or the samples are selected from
normally distributed population having mean significantly smaller than.
Test statistic
Under Null hypothesis, the test statistic is
Z = if population standard deviation is known and samples are drawn with
replacement
= if population standard deviation is unknown and samples are drawn with
replacement

= if population standard deviation and population size is
known and samples are drawn without replacement
= if population standard deviation unknown, population size is
known and samples are drawn without replacement
Where,
= sample mean
Population mean
= population standard deviation
s = sample standard deviation
N = population size
n = sample size
Level of significance
Let α be the level of significance. Generally we use = 5%.
Critical value
The critical or tabulated value is obtained from Z table
based on level of significance and alternative hypothesis.
i.e. Ztabulated = ,
Decision
• If > then we reject the null hypothesis.
• If then we accept the null hypothesis.
Example 1
The researchers are interested in the mean age of a certain population. A random
sample of 10 individuals drawn from the population of interest has a mean of 27 years.
Assuming that the population is approximately normally distributed with variance 20;
can you conclude that the mean is different from 30 years? (Use α=0.05).
Solution:
Given,
A sample size (n) = 10,
Sample mean age () = 27 years
Population standard deviation of age () = 20 and = .
Population mean of age () = 30 years
Level of significance () = 0.05
Null hypothesis: = 30 years i.e. there is no significance difference the sample mean
age and population mean age or the mean of population is not significantly different
from 30 years.
Alternative hypothesis: 30 years (two tailed test) i.e. there is significance difference between
the sample mean age of age and population mean of age or the mean age of population is
significantly different from 30.
Test statistic
= = = - 2.12
= 2.12
α = 0.05 be the level of significance.
Critical value
The critical or tabulated value at = 0.05 level of significance and for two tailed test is
i.e. Ztabulated = = = 1.96
Decision
Since > (i.e.2.12 > 1.96) then we reject the null hypothesis.
Hence we can conclude that the mean age of population is significantly different from 30 at
= 0.05 level of significance.
Example 2
Among 150 men in Bhaktapur, the mean systolic blood pressure was 146
mm Hg with a standard deviation of 27. On the basis of these data, May
you conclude that the mean systolic blood pressure for a population of
Bhaktapur is greater than 140 mm of Hg? Use α=0.01
Solution:
Given,
Sample mean systolic blood pressure () = 146 mm of Hg
Sample standard deviation of systolic Blood pressure (s) = 27.
Population mean of systolic blood pressure () = 140 mm of Hg.
Level of significance () = 0.01
Null hypothesis: 140 mm of Hg i.e. the population mean of systolic blood
pressure is significance less than or equal to 140 mm of Hg.
Alternative hypothesis: 140 mm of Hg (right tailed test) i.e. the population mean of
systolic blood pressure is significance greater than 140 mm of Hg
Test statistic
= = [ = s]
= 2.722
Let α = 0.01 be the level of significance.
Critical value
The critical or tabulated value at = 0.01 level of significance and for right tailed test is
i.e. Ztabulated = = = 2.33
Decision
Since > (i.e. 2.722 > 2.33) then we reject the null hypothesis.
Hence we can conclude that the population mean of systolic blood pressure is significance
greater than 140 mm of Hg at = 0.01 level of significance.
2. Test the significance difference two
sample means
Let us consider two independent samples of sizes and are drawn from two
independent normally distributed populations having means and and variances
and respectively. Also let and be the two independent means of sizes and.
Following steps involved in the test the significance of two independent sample
means
Null hypothesis: i.e. there is no significance difference between means of two
populations or two independent samples are drawn from normally distributed
population having same means.
Alternative hypothesis: (for two tailed test) i.e. there is significance difference
between means of two populations or two independent samples are drawn
from normally distributed population having different means.
Or,
: (for right tailed test) i.e. the mean of first population is significantly greater
than mean of second population.

Or, = if population variances and
: (for left tailed test) i.e. the are unknown.
mean of first population is = if population variances = .
significantly greater than
mean of second population. Where,
Test statistic = mean of sample of size
Under Null hypothesis, the = mean of sample of size
test statistic is = variance of first population
Z = if population variances = variance of second
and are known. population
= if population variances and = common population variance
are unknown
Critical value
The critical or tabulated value is obtained from Z table
based on level of significance and alternative hypothesis.
i.e., Ztabulated = ,
Decision
• If > then we reject the null hypothesis.
• If then we accept the null hypothesis.
Example 3
The mean and of standard deviation of BMI of 57 males was found to be
23.1 kg/m2 and 3.48 kg/m2 and the mean and standard deviation of BMI
of 49 females was found to be 20.74 kg/m 2 and 2.63 kg/m2 respectively. Is
the mean of BMI of males and female is significantly different?
Solution:
Given,
For males,
Number of males () = 57, mean of BMI () = 23.1 kg/m2 and standard
deviation of BMI() = 3.48 kg/m2.
For female,
Number of females () = 49, mean of BMI () = 20.74 kg/m2 and standard
deviation of BMI() = 2.63 kg/m2.
Null hypothesis: i.e. there is no significance difference between means of BMI for males
and females.
Alternative hypothesis: i.e. there is significance difference between means of BMI for
male and females.
Test statistic
= = = 3.969
Let = 5% be the level of significance.
Critical value
The critical value is obtained from the Z table based on 5% level of significance and two
tailed test is
= = 1.96
Decision
Since > (i.e. 3.969 > 1.956) then we reject the null hypothesis () at 5% level of significance.
Hence we concluded that there is significance difference between means of BMI for all
males and females

Example 4
The averages height of sample of 100 from Kathmandu is 172.34 cm and standard
deviation is 6.5 cm while the average height of a sample of 64 from a Dharan is 174.12
cm and standard deviation is 6.4 cm respectively. Do these figures indicate the people
from the Dharan are on average taller than the people from Kathmandu?
Solution:
Given,
For Kathmandu,
Numbers of people () = 100, Average height () = 172.34 cm and standard deviation of
height () = 6.5 cm.
For Dharan,
Number of people () = 64, Average height () = 174.12 cm and standard deviation of
height () = 6.4 cm.
Null hypothesis: i.e. there is no significance difference between average height of people
in Kathmandu and Dharan.
Alternative hypothesis: i.e. the average height of people from Dharan is significantly
greater than average height of people from Kathmandu.
Test statistic
= = = -1.723 = 1.723
Let = 5% be the level of significance.
Critical value
The critical value is obtained from the Z table based on 5% level of significance and left
tailed test is
= = -1.645
= 1.645
Decision
Since > = (i.e. 1.723 > 1.645) then we reject the null hypothesis () at 5% level of
significance.
Hence, we conclude that the average height of people from Dharan is significantly greater
than average height of people from Kathmandu.
3. Test the significance of single sample
proportion
Let P be the population proportion of units possessing a certain characteristic in the
population. Let a random sample of size n has been drawn from the population and
x be the number of units possessing the characteristic in the sample then sample
proportion is p = . For large sample size, the Binomial distribution can be
approximated to normal distribution.
There are following steps to test the significance of single sample proportion,
Null hypothesis: = i.e. there is no significance difference the sample proportion and
population or the samples are selected from normally distributed population having
proportion.
between the sample proportion and population proportion or the samples are
selected from normally distributed population having proportion is not equal to.

Or,
Alternative hypothesis: > (for right tailed test) i.e. the sample proportion is
significantly greater than population proportion or the samples are selected
from normally distributed population having proportion significantly
greater than.
Or,
Alternative hypothesis: < (for left tailed test) i.e. the sample proportion is
significantly smaller than population proportion or the samples are selected
from normally distributed population having proportion significantly
smaller than.
Test statistic
Z = if population proportion is known and samples are drawn with
replacement
= if population proportion and size is unknown and samples are drawn
without replacement
Where,
p = sample proportion
Population proportion
Q = 1 – P.
N = population size
n = sample size
Critical value
The critical or tabulated value is obtained from Z table based on level of
significance and alternative hypothesis.
i.e. Ztabulated = ,
Decision
If > then we reject the null hypothesis.
If then we accept the null hypothesis.
Example 5
According to the centres for disease control and prevention, 60% of all Americans adults ages
18 to 24 currently drink alcohol. A sample of 450 college students from California indicates
that the 66% currently drink alcohol. Is the proportion of college students from California who
currently drinks alcohol different from the proportion national wide?
Solution: Given,
Population proportion of drinks alcohol of adults ages 18 to 24 years (P) = 60% = 0.6
And Q = 1 – P = 1 – 0.6 = 0.4
Sample proportion of drinks alcohol of adults ages 18 to 24 years (p) = 66% = 0.66
Null hypothesis: = 60% i.e. the proportion of college students from California who currently
drinks alcohol is no significantly different from the proportion national wide.
Alternative hypothesis: P 60% i.e. the proportion of college students from California who
currently drinks alcohol is significantly different from the proportion national wide.
Test statistic
= = = = 2.598
Let α = 0.05 be the level of significance.
Critical value
The critical value is obtained from the Z table based on 5% level of significance and two
tailed test is
= = 1.96
Decision
Since Z > (i.e. 2.598 > 1.956) then we reject the null hypothesis () at 5% level of
significance.
Hence we conclude that the proportion of college students from California who currently
drinks alcohol is significantly different from the proportion national wide.
Example 6:(home work)
In a random survey of 1000 households in the United States, it is found that 29 percent
of the households have at least one member with a college degree. Does this finding
refute the statement that the proportion of all such United States households is at least
35 percent? Test at the α = .05 significance level
4. Test the significance difference between
two sample proportions
Let be the two population proportions possessing a certain characteristic. Let two
independent samples of sizes and drawn from the two normal populations. Also let and
be the proportions of units possessing certain characteristic in the two independent
samples.
The following steps involved in the test of significance difference the sample proportion.
Null hypothesis: i.e. there is no significance difference between proportions of two
populations or two independent samples are drawn from normally distributed population
having same proportions.
Alternative hypothesis: (for two tailed test) i.e. there is significance difference between
proportions of two populations or two independent samples are drawn from normally
distributed population having different proportions.
Or,
: (for right tailed test) i.e. the proportion of first population is significantly greater than
proportion of second population.
Or,
: (for left tailed test) i.e. the proportion of first population is significantly greater than
porportion of second population.

Test statistic
Z = if population proportions and are known.
= if population proportions and are known.
Where,
= = = combined unbiased estimate of population proportion
=1-
= proportion of sample of size
= proportion of sample of size
= proportion of first population
= proportion of second proportion
Level of significance Example 6:
Let α be the level of significance.
Generally we use 5%.
In a sample of 600 drivers
Critical value from a certain city, 50 drivers
The critical or tabulated value is are found to be HIV positive.
obtained from Z table based on level In a sample of 900 from
of significance and alternative another city, 450 are found to
hypothesis.
be HIV positive. Do the data
i.e. Ztabulated = ,
indicate that the two cities
Decision
are significantly different with
If > then we reject the null
hypothesis. respect to prevalence of HIV
If then we accept the null among population?
hypothesis
t-test (small sample test)
When sample selected from the population is less than or equal to 30 is
known as small sample size. In such case sampling distribution of sample
statistic is not approximately normal distributed as result Z test or normal
test is not applied.
For small sample the statistic value estimated vary from sample to sample
and also far from the population parameter. Hence modification in
hypothesis testing is made and is called exact sample test or sample test.
When sample size is less than or equal to 30(i.e. n 30) then the sampling
distribution of sample means follows Student’s t distribution. The t
distribution is also similar to normal distribution having shape as in normal
distribution but little bit flatter. Student’s t statistic is defined as
t=
t=
Where,
= = sample mean
S = = the unbiased estimate of population standard deviation.
s = = sample standard deviation or biased estimator of population standard
deviation.
n – 1 = degree of freedom
Student’s t-test is used when

• The sample size is less than or equal to 30(i.e. n 30).
• The parent population from which the sample is drawn is normal.
• The population standard deviation is unknown.
• The sample observations are independent.
• The samples are drawn random sampling technique.
t-test is used to test as
• Test the significance of single sample mean
• Test the significance difference between two independent sample means.
• Test the significance difference between two dependent samples means (paired
t-test)
1. Test the significance of single sample mean.

Let us consider a random sample of size of n (i.e. n 30) drawn from normal
population having mean and unknown variance. Also let be the independent
sample observation of size n.
It is based upon the assumption that the samples selected from normal
population with unknown variance and the sample observation are independent.
There are following steps involved in test the significance of single sample mean.

Null hypothesis: = i.e. there is no significance difference the sample mean
and population or the samples are selected from normally distributed
population having mean.
between the sample mean and population mean or the samples are selected
from normally distributed population having mean is not equal to .
Or,
Alternative hypothesis: > (for right tailed test) i.e. the sample mean is
significantly greater than population mean or the samples are selected from
normally distributed population having mean significantly greater than.
Or,
Alternative hypothesis: < (for left tailed test) i.e. the sample mean is
significantly smaller than population mean or the samples are selected from
normally distributed population having mean significantly smaller than.

Test statistic Level of significance
Let α be the level of significance. Generally we use 5%.
Degree of freedom
t= The degree of freedom is n – 1.
= Critical value
= The critical or tabulated value is obtained from t table
based on level of significance (), degree of freedom (n –
Where, 1) and alternative hypothesis ().
= = sample mean i.e. =
S= Decision
If then we reject the null hypothesis () at level of
= =the unbiased estimate of population significance.
standard deviation. If then we accept the null hypothesis () at level of
significance.
s=
Example:-
= = sample standard deviation or biased
A random sample of 10 bags is drawn and their content s
estimator of population standard are found to weight in kg as follows:50, 49, 52, 44, 45, 48,
deviation. 46, 45, 49, 45.
Test the significance of sample mean if the average
n – 1 = degree of freedom packing can be taken to be 50 kg.

Solution:
Weight(X)
50
49
2500
2401
Mean weight = =
52 2704 =47.3kg
44 1936
45 2025 And unbiased estimate of
48
46
2304
2116
population standard
45 2025 deviation
49 2401
45 2025 S=
= (22437-10
= 2.67kg
∑X=473 =22437 Population mean = 50 kg
Sample size n =10

Chi –square test ()
Chi –square test ()
• Chi-square test is one of the simplest and most widely used non
parametric tests used in statistical inference.
• It is based on chi-square distribution and do not based on any summary
values of population. Chi square test is most commonly used when the
data are in frequencies such as the number of response in two or more
categories.
• The chi-square test was first used by Karl Pearson’s and the quantity of chi
–square describes the magnitude of discrepancy between theory and
observation.
• It is also used to test the significance difference between observed
frequencies and expected frequencies.
• It is defined as
=
Where,
O = Observed frequency
E = Expected frequency
Chi- square test is used
• to test the goodness of fit
• to test the independence of attributes

Test the independence of attributes or
association between two variables
The characteristic which are capable The arrangement of observed
being measured qualitatively but no
frequencies classified in r × c
being quantitatively are called
attributes. Chi square test is to find consistency
B
table .
............. Total ()
A
association between the two .............
attributes.
.............
Let us consider a sample of size n is
.
taken from population of unknown
.
distribution. The sample
observations are classified into two
.............
attributes A and B into and classes
Total ............. N
respectively. Also let be the
()
observed frequency of class.
Null hypothesis (): there is no significance association between two attributes A
and B.
Alternative hypothesis (): there is significance association between two attributes
A and B.
Test statistic
Under null hypothesis, the test statistic is
=
Where O = Observed frequency
E = expected frequency and expected frequency for any cell is determined by
= =
Let α be the level of significance

Degree of freedom
The d.f. is (r - 1)(c - 1). Where r = numbers of rows and c = numbers of
column.
Critical value
The critical or tabulated value is obtained from the chi-square table based
on the α level of significance, (r – 1)(c - 1) degree of freedom and
alternative hypothesis is
=
Decision
• If then we reject the null hypothesis ().
• If then we accept the null hypothesis ().
When the two attribute A and B are
classified into 2 subgroups
2 2 consistency table of two attributes A and B is
B Total
A
a b a+b
c d c+d
Total a+c b+d N=a+b+c+d
Null hypothesis (): two attributes A and B are independent.

Alternative hypothesis (): two attributes A and B are dependent.
Test statistic
Under null hypothesis (H0), the test statistic is
=
If frequency of any cell is less than 5 then made correction by using
Yates’s correction as
=
Let α be the level of significance
Degree of freedom
The d.f. is 1
Critical value Example:
The critical or tabulated value is In an experiment to study the dependence
obtained from chi-square table based of hypertension on smoking habit, the
following data were taken on 186 individuals
on level of significance, degree of
freedom and alternative hypothesis. No Moderate Heavy
smoker smoker smoker
i.e.
Hypertension 21 36 36
Decision
No 48 26 19
• If > then we reject the null hypertension
hypothesis at α level of significance.
• If ≤ then we accept the null
hypothesis at α level of significance. Test the hypothesis that presence or absence
of hypertension is independent of smoking
habit.

Solution:
Modera number of
No Heavy
te Cells patient (O) E =
smoker smoker Total
smoker
( ( =
(
( 21 34.5 182.25 5.282609
Hyperte
21 36 36 ( 36 =31 25 0.806452
nsion (
93 =
( 36 27.5 72.25 2.627273
No
=
hyperte 48 26 19
nsion ( ( 48 34.5 182.25 5.282609
93 ( 26 =31 25 0.806452
=
Total 69 62 55 N = 186
( 19 27.5 72.25 2.627273
total = 186 = 186 =17.43267

Null hypothesis (): i.e. the hypertension is Degree of freedom
independent of smoking habit or the is The d.f. is (r - 1)(c - 1) = (2-1)(3-1)
no significance association between = 2.
hypertension and smoking habit
Critical value
Alternative hypothesis (): i.e. the The critical or tabulated value is obtained
hypertension is dependent of smoking from the chi-square table based on the 5%
habit or the is significance association level of significance, 2 degree of freedom
and alternative hypothesis is
between hypertension and smoking habit
= = 5.991
Test statistic
Decision
Under null hypothesis, the test statistic is
Since > i.e. 17.43 > 5.991 then we reject
= = 17.43 the null hypothesis ()at 5% level of
significance.
Let α = 5% be the level of significance Hence the hypertension is not independent
of smoking habit at 5% level of significance.
Example:
A tobacco company claims Null hypothesis (): i.e. there is no
significance relationship smoking
that there is no relationship
habit and lung alignment.
between smoking habit and
lung alignment. To investigate Alternative hypothesis (): i.e. there
the claims, a random sample is significance relationship smoking
of 300 males
Lung
in No
age
lung
groups
Total
of habit and lung alignment.
40 to 50alignment
is givenalignment
medical test.
The observed
Smoker 75 sample 180
105 result Test statistic
aresmoker
No tabulated
25 below:
95 120
Under null hypothesis, the test
statistic is
Total 100 200 300
=
= =14.06

Level of significance Hence there is significance
Let α=5% be the level of significance relationship smoking habit and
Degree of freedom lung alignment at 5% level of
The d.f. is 1 significance.
Critical value
The critical or tabulated value is obtained
from the chi-square table based on the Example:
5% level of significance, 1 degree of
freedom and alternative hypothesis is Examine whether the vaccine has
Affected Not affected
= = 3.841 an effect in controlling the
Inoculated 2 10
Decision disease. In an experiment on
Not 6 6
Since > i.e. 14.06 > 3.841 then we reject immunization
inoculated
of cattle from
the null hypothesis ()at 5% level of tuberculosis, the following
significance results were obtained:
Null hypothesis (): i.e. the vaccine has Degree of freedom
significantly no effect in controlling the The d.f. is 1
disease.
Alternative hypothesis (): i.e. the vaccine Critical value

has significantly effect in controlling the The critical or tabulated value is
disease. obtained from the chi-square table
based on the 5% level of significance, 1
Test statistic degree of freedom and alternative
Under null hypothesis, the test statistic is hypothesis is
= = = 3.841
= Decision
Since < i.e. 1.6875 < 3.841 then we
Level of significance accept the null hypothesis ()at 5% level
Let α=5% be the level of significance of significance
Correlation
• Correlation is a statistical device designated to measure
the degree of association between two or more variable
e.g. to studying the relationship between height and
weight of children, blood pressure and age of patients,
fever and weight of children, income and expenditure etc.
• To measure the degree of association between such types
of variables one statistical tool is needed and known as
correlation and summary value of such statistics is known
as correlation coefficient.
• It is generally denoted by r and independent of original
unit of measurement of variables in study variables.
Types of correlation
Types of correlation Methods of studying
• Positive correlation correlation
• Negative correlation • Graphical method or
• Linear correlation scatter diagram
method.
• Non-linear correlation
• Karl Pearson’s
• Simple correlation
correlation coefficient
• Partial correlation
• Multiple correlation

Karl Pearson’s correlation coefficient
• Karl Pearson’s correlation Where,
coefficient measure the degree r = correlation coefficient
of linear relationship between cov(X, Y) = co-variance between
two variables. two variable variables and which
• Let X and Y are the two variables measure the simultaneous changes
then correlation coefficient between two variables
between X and Y is denoted by =
r(X, Y) or rXY or simply r.
=
• It is also known as product
Var(X) = variance of X variable
moment correlation coefficient which measure the variation of X
or simply correlation coefficient variable
and given by
= =
r=
=
Var(Y) = variance of Y variable which measure the
variation of Y variable
= =
=
r=

Interpretation of Karl Pearson’s correlation
coefficient
• If r = +1, then there is perfect positive correlation between
two variables
• If r = -1, then there is perfect negative correlation between
two variables
• If r = 0, then there is no correlation between two variables
• If r closer to +1, then there is high degree positive
correlation between two variables
• If r closer to -1, then there is high degree negative
• If r closer to 0, then there is low degree positive or negative
Properties of Karl Pearson’s correlation
coefficient
• The correlation coefficient is lies between -1 to +1 i.e. -1 <
r < +1.
• The correlation coefficient is geometric mean of two
regression coefficient i.e. r =
• Where, = regression coefficient of Y on X and = regression
coefficient of X on Y.
• The correlation coefficient is independent of change in
origin as well as scale.
• The correlation coefficient is a relative statistical measure.
• The correlation coefficient is symmetrical i.e. .

Example
An observation on X and Y are given below.
Where X and Y represents age in years and
weight in kg respectively.
Age (X) 2 5 7 11 9 14
Weight 10 18 24 30 25 35
(Y)
Calculate the Karl Pearson’s correlation
coefficient and interpret the result

weight r
Age(X) (Y) XY
=
2 10 4 100 20 = 0.988
5 18 25 324 90
7 24 49 576 168
11 30 121 900 330
9 25 81 625 225
14 35 196 1225 490
48 142 = 476 = 3750 1323

Practical question
• The probability that a student at a The occurrence of retinal capillary
certain college will catch a mild cold in hemangioma (RCH) in patients with disease.
winter is 0.60. If six students are The number of RCH tumor incidents followed
randomly selected at random during a Poisson distribution with mean 4 tumors
winter, find using the output below the per eye for patients with disease. Find the
probability that: probability that in a randomly selected
(i) Exactly 4 students will catch the cold? patient with disease:
(ii) Less than two will catch the cold. (a) There are exactly five occurrences of
tumors per eye.
(iii) At most two students will catch the
cold? (b) There are more than five occurrences of
tumors per eye.
(iv) At least 2 students will catch the cold?
(c) There are fewer than five occurrences
(v) More than two will catch the cold?
of tumors per eye.
(vi) Also calculate the mean and variance.
(d) There are between five and seven
occurrences of tumors per eye,
inclusive
• If the total cholesterol values for a Narcolepsy is a disease
certain population are
approximately normally distributed involving disturbances of
with a mean of 200 mg/100 ml and the sleep–wake cycle. A
a standard deviation of 20 mg/100
ml, find the probability that an researcher studied the
individual picked at random from relationship between
Migraine headaches Total
this population will have a
cholesterol value: migraineyesheadaches
No in 96
(a) Between 180 and 200 mg/100 ml subjects
Narcolept 21 diagnosed
75 with
96
ic
(b) Greater than 225 mg/100 ml narcolepsy and77 96 healthy
(c) Less than 150 mg/100 ml Control 19 96
(d) Between 190 and 210 mg/100 ml
controls.
Total 40
The 152results 192 are
shown in the table:

Test the association Suppose age in days and
between migraine birthweight in oz are
headaches in diagnosed measured for 10 infants
with narcolepsy and are as shown in following
healthy controls. table

Patient Age Blood pressure
Calculate the correlation
1 16 46
coefficient between age 2 19 40
and blood pressure of ten 3 23 50
patients from following 4 13 35
data: 5 14 30
6 12 60
7 17 50
8 21 45
9 11 42
10 15 45

Inferential Statistics

Uploaded by

Copyright:

Available Formats

You might also like

Inferential Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inferential Statistics

Uploaded by

Copyright:

Available Formats

Inferential statistics

06/17/2024 RAM KRISHNA TAMANG 1

06/17/2024 RAM KRISHNA TAMANG 2

06/17/2024 RAM KRISHNA TAMANG 3

06/17/2024 RAM KRISHNA TAMANG 4

06/17/2024 RAM KRISHNA TAMANG 5

06/17/2024 RAM KRISHNA TAMANG 6

06/17/2024 RAM KRISHNA TAMANG 7

06/17/2024 RAM KRISHNA TAMANG 8

06/17/2024 RAM KRISHNA TAMANG 9

06/17/2024 RAM KRISHNA TAMANG 10

level of glucose (F): 70, 120, at 95% confidence level

06/17/2024 RAM KRISHNA TAMANG 13

06/17/2024 RAM KRISHNA TAMANG 14

• Null hypothesis: = i.e. there is no significance difference

06/17/2024 RAM KRISHNA TAMANG 17

Accept null Correct decision Wrong or error

Reject null Wrong or error Correct decision (1 - )

06/17/2024 RAM KRISHNA TAMANG 21

• In other words, the rejection of true hypothesis is known as

= probability (type I error)

• In other words, acceptance of false hypothesis is known

= probability (type II error)

06/17/2024 RAM KRISHNA TAMANG 23

• It is denoted by α and defined as

• In other words, p-value is the probability of greater or equal to specific test in

• Similarly, exceeding or below the specific test in the left tail.

06/17/2024 RAM KRISHNA TAMANG 25

• The region which leads to the rejection of null hypothesis () is

06/17/2024 RAM KRISHNA TAMANG 28

06/17/2024 RAM KRISHNA TAMANG 29

• In other words, an alternative hypothesis leads to two alternatives to null hypothesis, it

• In other words, if direction of difference is not given in the statement of hypothesis

• In other words, if the direction of difference is given in the statement of hypothesis

Critical value approach

06/17/2024 RAM KRISHNA TAMANG 32

Z-test are used to test

Alternative hypothesis: (for two tailed test) i.e. there is significance

06/17/2024 RAM KRISHNA TAMANG 36

06/17/2024 RAM KRISHNA TAMANG 43

06/17/2024 RAM KRISHNA TAMANG 47

06/17/2024 RAM KRISHNA TAMANG 50

06/17/2024 RAM KRISHNA TAMANG 55

Student’s t-test is used when

1. Test the significance of single sample mean.

06/17/2024 RAM KRISHNA TAMANG 60

06/17/2024 RAM KRISHNA TAMANG 61

06/17/2024 RAM KRISHNA TAMANG 62

06/17/2024 RAM KRISHNA TAMANG 63

06/17/2024 RAM KRISHNA TAMANG 65

06/17/2024 RAM KRISHNA TAMANG 67

Total a+c b+d N=a+b+c+d

Null hypothesis (): two attributes A and B are independent.

06/17/2024 RAM KRISHNA TAMANG 71

total = 186 = 186 =17.43267

06/17/2024 RAM KRISHNA TAMANG 75

Alternative hypothesis (): i.e. the vaccine Critical value

06/17/2024 RAM KRISHNA TAMANG 79