Professional Documents
Culture Documents
Chapter Five Sampling and Sampling Distribution
Chapter Five Sampling and Sampling Distribution
Introduction
The main objective of statistical analysis is to know the actual value of different parameters
of a given population. One way of knowing the parameters can be through conducting census.
A census means complete enumeration of the entire population and determining the value of
parameter of interest. However, in most cases census is not feasible from practical point of
view due to cost, time, labor and other constraints. Alternative to census one can use
sampling approach to determine the same thing. Sampling is the process of selecting a sample
from a population. That is a random samples of a given size are taken from the population
and these samples characteristics are properly analyzed to infer the characteristics of the
population from the sample taken.
When random samples of a certain size are repeatedly drawn from a given population to
determine sample statistic, the computed value of the sample statistic (e.g. Sample mean) will
differ from sample to sample. Since the sample statistic based on a sample of certain size,
they are a random variable and each follow a probability distribution of its own called
sampling distribution
Sampling distribution has its own properties upon which rules for generalizing about
population based up on sample drawn from a population. In this chapter, we will study the
properties of some statistics a bit in depth and about widely used sampling distribution such
as t, F, and distribution.
2
Random variable
A variable is a random variable if its value determined by a random experiment. If variable X
is said to be random variable, it represents a phenomena of interest in which the observed
outcomes of an activity is entirely by chance. It is unpredictable and varies or changes
depending up on particular outcome of experiment measured. For example, suppose you toss
a die and measure X as the numbers observed on the upper face. The variable X can take on
any of six values 1, 2, 3, 4, 5 and 6 depending on random outcome of the experiment. Since
the value of X cannot determine before the experiment, variable X represents a random
variable. X can be also being occurrence of an event like number of telephone call received
randomly during a given time.
Population (or universe): is the aggregate of statistical data forming a subject of
investigation
Statistic
A statistic is a numerical descriptive measures calculated from a sample. In other words, it
represents the summary measures that describe the characteristics of a sample. In most cases,
it refers to sample mean and sample variance. If X 1, X2. . . Xn are a random sample, then
n n
x i ( x x)
i
2
X i 1
S2 i 1
n is called a sample mean and n 1 is called a sample variance. The
X i
3 6 9
X i 1
6
N 3
Xi Xi - X (Xi- x )2
3 -3 9
6 0 0
9 3 9
( x x)
i
2
18
( x x) 2
18
S 2 i
9
n 1 2
S 92
Parameter
A parameter is a numerical descriptive measure that characterizes a population. In other
words a summary measure that describes any characteristic of the population. Since it is
determine based on observations of population, the value of parameters are unknown in the
case of large population. Parameters include population mean and variance among others.
The mean and the variance of the above population represent the parameter of a given
3 6 9 12 15
9
population. That is 5 representing a parameter of a population that
populations mean. Here we can determine population parameter since the population under
the study is finite.
Sampling distribution
Sampling distribution provides the basis for determining the level of confidence or reliability
with which a particular value of a given sample statistics can be used as an estimate of the
parameter. It also serves as the necessary ground for evaluating a particular hypothesis stated
with reference to a parameter. Both these processes require a clear understanding of the
various sampling distributions and their properties defining the relationships between a given
sampling statistics and the corresponding population parameter. Therefore, let us first
describe what a sampling distribution means and understand the properties of different
statistical sampling distribution.
As stated in the introductory part, sampling is used alternative to censuses to determine the
characteristic of a population. That is a random sample of a given size is taken from a given
population up on which we based to estimate the parameters of a given population. However,
when the samples drawn repeatedly from a population, the sample may or may not be a
representative sample. In other words, sample statistics such as sampling mean and variances
are random variables because different samples can lead to different values of sample
statistics.
Since the value of a statistic for different samples has its own number of occurrence
(frequency), based on the frequency of occurrence, the probability for obtaining a given
statistic can be determined. A sample statistic associated with its probability of occurrence
represents sampling distribution.
Definition: The sampling distribution of a statistic is the probability distribution for the
possible values of the statistic that results when random samples of size n are repeatedly
drawn from the population.
Example: Consider a population consists of N = 5 numbers 3,6,9,12,15. If a random sample
of size n=3 is selected without replacement find the sampling distributions for the sample
mean, X .
Solution: There are 10 possible random samples of size n=3 and each sample have equally
1
likely draw with probability of 10 . These samples, along with the calculated value of X
are given as follows:
Child ( X ) Age
1 2
2 4
3 6
4 8
5 10
If a random samples of size 2 without replacement drawn from this population, construct the
sampling distribution of sample mean.
Therefore, on average, the mean of the sampling is equal to the population mean.
NB: As the sample size increases, the sample mean concentrates to the population mean. As n
approaches infinity, the sample mean just coincide with the population mean.
Example 1: It is noted from the past observation that the incomes of the household in a
certain village are approximately normally distributed with a mean weekly income of 30 Birr
and variance of 36. Samples of 25 sizes of households are to be selected and their incomes
are recorded.
a) Find the probability that the sample mean will fall within three birr of the population mean.
b) How many observations should be included in the sample if we wish the sample mean to
be within 2 Birr of the population mean with the probability of 0.95?
Suppose X is a non-normally distributed with mean μ and variance σ2. In this case we going
to approximate the sample mean by using normal distribution (based on the following
theorem).
Central Limit Theorem (CLT): Is one of the most important theorems in statistics. In
selecting simple random samples of size n from a population, the sampling distribution of the
sample mean ( X́ ) can be approximated by a normal probability distribution as the sample size
becomes large.
CLT: If random samples of n observations are drawn from a population with any probability
distribution with mean µ and standard deviation δ, then, when n is large, the sampling
distribution of the sample mean X is approximate normal distribution with mean µ and
δ2
variance n .
2
X : N (, )
n
Even if we are sampling from a non-normal population, we can use a normal distribution if n≥30
as an approximation to a sample mean
Example 2: A soft drink vending machine is set so that the amount of drink dispensed is a
random variable with mean of 200 milliliters and standard deviation of 15 milliliters. What is
the probability that the mean amount dispensed in a random sample of size 36 is at least 204
milliliters?
δ 15
δ = = =2. 5
X
√ n √36
According to central limit theorem, the sample mean approximately normally distributed and
X 204 200
Z 1.6
can be converted to standard normal as SE ( X ) 2.5
The probability that the sample mean greater than 204 is P( X 204) P( Z 1.6) . From
the standard normal, Z- table P (Z > 1.6) = 0.0548. From the result we can concluded that the
probability that sample mean will be greater than 204 is equal to 0.0548.
Exercise: A bulb manufacture claim that the life of its bulb is normally distributed with
mean 36,000 hours and standard deviation of 4,000 hours. A random sample of 16 bulbs had
on average life of 34,500 hours. If the manufactures claim is correct what is the probability
that the sample mean smaller than 34,500 hours.
In the proceeding section, we have studied some of the properties of sampling distribution of
sample mean. In this section, we will consider sampling distribution of variance that is used
for inference about population variance. Consider a random sample of n observation drawn
from population with unknown mean and unknown variance δ2. If the sample members are x1,
x2 -------xn,
The population Variance δ2 is defined as:
δ 2=E [ ( x−μ)2 ]
2
The sample Variance, S is defined as
1
S 2= ∑ ( X i− X )2
n−1 Moreover, its square root is termed as sample standard
deviation.
Here we use n-1 to find sample standard deviation for a random sample of n- observation.
This is because we computed sample mean and left with n-1 different value that can be
uniquely defined.
Given the above definition of sample variance, let us define its mean and distribution.
The mean (the expected value) of sample variance is equals to population variance.
E (S2) = δ2
2
S 2
=
∑ ( X i −X )
Proof n−1 from the chapter on expectation
n
2
∑ ( X i−X )2 = ∑ [ ( X i−μ ) − ( X −μ ) ]
n−1
= ∑ [ ( X i −μ )2 − 2 ( X−μ ) ( X i −μ ) + ( X −μ )2 ]
2
= ∑ ( X i −μ )2 − 2 ( X−μ ) ∑ ( X i−μ ) + ∑ ( X−μ )
= ∑ ( X i −μ )2 −2 n ( X −μ) 2 +n ( X−μ )2
= ∑ ( X i −μ )2 −n ( X−μ )2
Taking the expectation
δ2
is the variance of the sample mean, that is, n .Hence we have
n
nδ 2
E
[ ∑ ( X i− X )2 = nδ 2−
i =1 ] n
= ( n−1) δ 2
1
E (S2) = E
[ n−1
∑ ( X i −X )2
]
1
= E ( ∑ ( X i− X )2 )
n−1
1
= ( n−1 ) δ2 = δ 2
n−1
So ⇒ E (S 2 )=δ 2
This implies, sample variance, 2 is unbiased estimator of population variance, 2. This means
that, in a repeated sampling, the average of all your sample estimates will equal the target
parameter, 2.
As we have seen in the preceding topics, identifying the distribution of sampling distribution
of a sample statistic is essential to make inference about a parameter of population.
Therefore, let us identify the distribution of sample variance.
Consider the distribution of 2 on a repeated random sampling from a normal distribution.
Theoretically, since variance cannot be negative, the sampling distribution of sample variance
starts from 2 = 0. Its shape is non-symmetric and changes with each different sample size
and value of 2. As we standardize random variables by forming Z-distribution, sample
variance can be standardized and form a distribution called chi-square distribution. Given a
random sample of n observations from a normally distributed population whose population
2
(n−1 )S
has a chi square ( )
2
variance is 2 and resulting sample variance is s2, and then δ2
distribution with n-1 degree of freedom.
F(
v2 ) Chi square distribution
(
v2 ) Chi square
When the population mean is not known, a particular sample mean X based on a random
(X
2
X
2
X X X )2
i i i
2
s2
(X i X )2
or (X X )2 (n 1) S 2
i
n 1
( n 1) S 2
v2
2
Chi square has many important applications. Some of its application (uses) is
- Test of independence of attributes
- Test of goodness of fit
- Test for the equality of population variance and test for homogeneity
The calculated value of is compared with the critical value at a particular level of
2
significance and degree of freedom. If cal critical , then the null hypothesis is rejected in
2 2
The chi square distribution has several important mathematical properties. Some of them are
the following.
1. If X1, X2 - - - Xn are independent random variables having standard normal
distributions, then
n
Y X i2
i 1 Has the chi-square distribution with V=n-1 degree of freedom.
2. If X1, X2 - - - Xn are independent random variables having chi-square distribution with
V1, V2, - - - Vn degrees of freedom, then
n
Y Xi
i 1 Has the chi-square distribution with V1 + V2 + V3 - - - Vn degree of
freedom.
3. The mean and variance of chi-square distribution are equal to the number of degree of
freedom and twice the number of degree of freedom
E V and var
2
v
2
V
2V
Where V is the degree of freedom
That is
n 1 2 n 1
E E S2 but E ( 2 ) 2
2
2
n 1
2 2 n 1
To obtain the variance of S2
n 1 2
2
n 1
Var s 2 var( s 2 )
2
2 2
Var ( 2 )
n 1
n 1
2
2 2
2 (n 1)
2( n 1)
For many applications involving, the population variance we need to find values for the
cumulative distribution of , especially the upper and lower tails of the distribution. To
2
make inference about the population variance the calculated value of is compared with
2
tabulated value of for the given level of degree of freedom. For convenience of
2
interpretation, the values listed under any column headed by specific value of may be
2
2
2
2
2
=0.05
2
This 0.05 =18.3, the area = 0.05 is the probability that X2 value based on sample of size
2
2
Tabulated value of X 2
distribution with v=10 above which the area is .
The probability above can be stated as
P v2 P X 18.3 P 18.3 X 005
2
0.05
2 2
10
3.94 0.05
P(
2
18.31) 0.05
10
2
P x
2
x, y
Example
A cement manufacturer claims that concrete prepared from his product has a relatively stable
compressive strength and that strength measured in kilograms per square centimeter lies
kg
40
within a range of cu 2 . A sample of n=10 measurements produced a mean and a
variance equal to, X =312 and s2 = 195. Do these data present sufficient evidence to reject
the manufacturer’s claim d.f the population variance is equal to 100.
The claim of the manufacturer can be rejected if the calculated value of chi square exceeds
2
16.919
the critical value of 0.05,9
from the table
( n 1) s 2 9(195) 175
2 17.55
2
100 100
Since the observed value of chi square value 17.55 is greater than the critical value, we can
reject the manufacturer claim.
One-way to compare two population variances, δ 1 and δ 2 , is to use the ratio of the
2
s 1
2
sample variances, s 2 . When independent random samples are drawn from two normal
populations with equal variance from two normal populations with equal variance,
2 2
then s 1
2
1 2
s 2 has a probability distribution in repeated sampling that is termed as
F-distribution.
F-distribution is a sampling distribution of the ratio of two independent random variables
with chi square distributions, each divided by its respective degree of freedom. If U and v are
independent random variables having chi square distributions with v1 and v2 degree of
freedom, then
2
u 1
v1 v1
F
v
2
v2 2
v2
Is a random variable having F-distribution whose values vary with every set of two samples
of size n1 and n2.
and
2 2
F 1
2
( n2 1) s 2
2
2
n2 1
F
2 2
s 2 1
s
1 2
1 2
2 2
If δ 1 and δ 2 are the variances of independent random samples of size n 1 and n2 from
2 2
F v1 ,v2 1 s1
2 2
2s 2
2 2
To test whether the variance of two populations is equal or not, compare the calculated value
of F with the critical value of F.
Example: The research staff of investors was interested to determine if there is a difference
in the variance of maturities of AA-rated industrial bond and CC-rated industrial bonds. A
2
X−μ
2
δ δ
variance n . In other word √ n has standard normal distribution if the size of
sampled population is normal or the size of sample used is large. However if the sample size
used is small making inference about the population mean based on z-distribution as test
statistics involves two type of problems.
1. The shape of the sampling distribution of X and /or Z statistics depend on the shape
of population sampled. We can no longer assume that the distribution of X is
approximately normal, because central limit theorem ensures normality only for
sample that are sufficiently large. However, the sampling distribution of sample mean
is normal if the sampled population is normal.
2. The population standard deviation is always unknown. Even though it is possible to
estimate the population standard deviation with the sample standard deviation s, it is
poor approximation of for population standard deviation when the sample size is
small.
In the case where the population standard deviation is unknown, standard normal statistic
cannot be used. It is natural to replace the unknown by the sample standard deviation, s.
These will gives a distribution called student t-distribution after Gosset who developed the
X
t
s
probability distribution of the statistic n .
Given a random sample of n observations, with mean X and standard deviation s, from
normally distributed population with mean , the random variable t, follows the student’s t
distribution with (n-1) degrees of freedom .The shape of the student’s t distribution is rather
similar to that of the standard normal distribution. Both distributions have mean zero, and the
probability density functions of both are symmetric about their mean. However, the density
function of the student’s t distribution has a larger dispersion (variability) than the standard
normal distribution. The actual amount of variability in the sampling distributions of t
depends on the size of the sample n.
As the number of degree of freedom increases, (sample size increases) the student’s t-
distribution becomes increasingly similar to the standard normal distribution. This is
intuitively reasonable and follows from the fact that for a large sample size, the sample
standard deviation is a very precise estimator of the population standard deviation. In
particular, the small the degree of freedom associated with the t-statistic, the more the
variable will be its sampling distribution.
If xi is the n sample values drawn from a normal population with mean and variance 2, the
x i−μ
Z=
δ
standard normal random variable can be defined as: √n which follows a normal
distribution with mean 0 and variance 1. For the same n sample values, the square of standard
2
i
A sample statistic t can be also defined as the ratio of the standard normal Z to the square root
of chi square distribution –Y.
Z X
t where Z
y
n 1 n
y
2
Z i
x
n X
t
X i X 1 X i X
1 2
n n 1
n n 1
X
t
S n
X X
t Z
n n .
In order to base inferences about population mean on student’s t-distribution critical values
are tabulated for different degree of freedom. t,v represents the area to the right under the
curve of the t-distribution with v degrees of freedom is equal to . That is if t is a random
variable having t-distribution with v degrees of freedom, then P (t > t,v) = .Since the
density function is symmetric, t1-,v = -t,v
The tabulated t values are denoted by t. The area under the t-distribution curve above t is
and the one below t is 1-.
t1-=-t t
Tabulated t and t1-=-tvalues
The areas under the t distribution curve can be interpreted in terms of probabilities by taking,
say a t distribution based on a sample of size n=15. Thus for v=n-1 = 14, the t value above
which the area under t-curve is =0.05, is t0.05=1.76. It means the probability of t value
computed for a random sample of size n=15 being greater than t 0.05=1.76 is =0.05 and may
be stated as
P(t > t0.05) = P(t > 1.76)=0.05
Similarly, the t value below which the area under the t-distribution curve is =0.05 is t0.95 =
-t0.05 = -1.76. It means the probability of t value based on a random sample of size n=15 being
less than –t0.05 = -1.76 is =0.05. It is stated as P(t < -t0.05) = P (t < -1.76) = 0.05
=0.05 =0.05
-1.76 1.76