Statistical Methods For Research - Additional Notes On Sampling Tests (2022)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

ADDITIONAL NOTES ON SAMPLING DISTRIBUTIONS

1.3 Sampling Distributions for Proportion

Example 3.2: Binomial Distribution


The population of human beings can be classified as “having type O blood” or “not
having type O blood”. There is no way that we can get exact information about the entire
population, since this group is so large. It has been estimated that the proportion of
people with type O blood is 0.40. Assume that the estimate is correct. If we observe a
single person selected at random, the probability that the person will have type O blood is
0.40 and the probability that the person will not have type O blood is 0.60.
Let us imagine that large metropolitan hospital has a list of several thousand people
willing to donate blood.

If 4 people are chosen at random from the list, how likely is it that:-
None have type O blood?
One has type O?
Two have type O?
3? 4?

Solution:
First, list the different possible outcomes for a sample of 4 people. Let O means that, a
person has type O blood, and let N means that, the person does not have type O blood.
The sequence of symbols indicates the results in the order in which they occur in the
experiment, so NNON is a different outcome from ONNN

Table 3.2: the domain and its possible outcome

Number with type Possible outcomes


O blood, 𝑛(𝑂)
0 NNNN
1 ONNN, NONN, NNON, NNNO
2 OONN, ONON, ONNO, NONO, NNOO, NOON
3 OOON, OONO, ONOO, NOOO
4 OOOO

Here, 𝑃(𝑁) = 0.6 and 𝑃(𝑂) = 0.4.

The probability of zero out of four having type O blood is,


𝑃[𝑛(𝑂) = 0] = 𝑃(𝑁𝑁𝑁𝑁) = [𝑃(𝑁)]4 = 0.64 = 0.1296

The probability that 1 out of 4 will have type O blood is,


𝑃[𝑛(𝑂) = 1] = 𝑃(𝑂𝑁𝑁𝑁) + 𝑃(𝑁𝑂𝑁𝑁) + 𝑃(𝑁𝑁𝑂𝑁) + 𝑃(𝑁𝑁𝑁𝑂)
= 4 × [𝑃(0)] × [𝑃(𝑁)]3 = 4 × 0.4 × 0.63 = 0.3456
The probability that 2 out of 4 will have type O blood is,
𝑃[𝑛(𝑂) = 2] = 𝑃(𝑂𝑂𝑁𝑁) + 𝑃(𝑂𝑁𝑂𝑁) + 𝑃(𝑂𝑁𝑁𝑂) + 𝑃(𝑁𝑂𝑂𝑁) + 𝑃(𝑁𝑂𝑁𝑂)
+ 𝑃(𝑁𝑁𝑂𝑂) = 6 × [𝑃(0)]2 × [𝑃(𝑁)]2 = 6 × 0.42 × 0.62 = 0.3456

The probability that 3 out of 4 will have type O blood is,


𝑃[𝑛(𝑂) = 3] = 𝑃(𝑂𝑂𝑂𝑁) + 𝑃(𝑂𝑂𝑁𝑂) + 𝑃(𝑂𝑁𝑂𝑂) + 𝑃(𝑁𝑂𝑂𝑂)
= 4 × [𝑃(0)]3 × [𝑃(𝑁)] = 4 × 0.43 × 0.6 = 0.1536

Finally, the probability of 4 out of four having type O blood is


𝑃[𝑛(𝑂) = 4] = 𝑃(𝑂𝑂𝑂𝑂) = [𝑃(0)]4 = 0.44 = 0.0256

In summary, for example 3.2, the probability distribution is as appeared in the following
figure 3.3,
0.35
0.30
0.25
0.20
px
0.15
0.10
0.05

0 1 2 3 4
x

Figure 3.3: Probability Mass Function for Binomial distribution with 𝑛 = 4 and 𝑃(𝑂) =
0.4

The figure above is the discrete random variable of Binomial with 𝑛 = 4 and the
probability of success, 𝑝 = 0.4, (i.e. 𝑁(𝑂)~𝐵𝑖𝑛(𝑛 = 4, 𝑃(𝑂) = 0.4)).
The discrete random variable with values 0,1,2,3,4 represents the number of people with
type O blood in a random sample of 4 people.
𝑃[𝑁(𝑂) = 𝑛(𝑂)] is the probability function of 𝑁(𝑂), (binomial probability distribution).

Note that a binomial probability distribution is a model of an experiment with only two
possible outcomes. We concentrate on one of the outcomes, type O blood (for instance),
and count the number of occurrences (successes) in the sample. The probability of type O
blood does not change from observation to observation, and the observations are
independent of each other.
Example 3.3:
A large drug company has 100 potential new prescription drugs under clinical test. About
20% of all drugs that reach this stage are eventually licensed for sale. What is the
probability that at least 15 of the 100 drugs are eventually licensed? Assume that the
binomial assumptions are satisfied, and use a normal approximation with continuity
corrections.

Solution:
Variable = Number of drugs that are licensed for sale, 𝑋.
𝑋~𝐵𝑖𝑛(𝑛 = 100, 𝑝 = 0.2)
Then, 𝜇 = 𝑛𝑝 = 100 × 0.2 = 20 and 𝜎 2 = 𝑛𝑝(1 − 𝑝) = 100 × 0.2 × 0.8 = 16,
We may approximate 𝑋 to normal distribution, i.e., 𝑋 → 𝑁(𝜇 = 20, 𝜎 2 = 16)

The desired probability is that 15 or more drugs are approved. Because 𝑥 = 15 is


included, the continuity correction is to take the event as 𝑋 greater than or equal to 14.5,
i.e.,
14.5 − 20
𝑃(𝑋 ≥ 14.5) = 1 − 𝑃(𝑋 < 14.5) = 1 − Φ ( ) = 1 − Φ(−1.38)
4
= 1 − [1 − Φ(1.38)] = Φ(1.38) = 0.9162
1.4 Additional notes to the sampling distribution

For 𝑖 = 1, … , 𝑛, 𝑋𝑖 is normally distributed having the mean, 𝜇 and variance, 𝜎 2 . For the
normal distribution, 𝑋𝑖 can be denoted as,
𝑋𝑖 ~𝑁(𝜇, 𝜎 2 )

The normal density of 𝑋𝑖 is defined as,


1 1 𝑥𝑖 −𝜇 2
𝑓𝑋𝑖 (𝑥𝑖 ; 𝜇, 𝜎 2 ) = √2𝜋𝜎2 exp [− 2 ( ) ], −∞ < 𝑥𝑖 < ∞
𝜎

∑𝑛
𝑖=1 𝑋𝑖
As the average of 𝑋𝑖 , i.e. 𝑋̅ = , then mean and variance of average is determined as,
𝑛
𝜎2
𝐸(𝑋̅) = 𝜇 and 𝑉𝑎𝑟(𝑋̅) = . Hence, 𝑋̅ forms normal distribution,
𝑛
𝜎2
𝑋̅~𝑁 (𝜇, )
𝑛

If 𝑋1 , 𝑋2 , … , 𝑋𝑛 are independent, we can form the likelihood as follows,


𝑛 𝑛 1 1 𝑥𝑖 − 𝜇 2
𝐿(𝜇, 𝜎 2 |𝒙) = ∏ 𝑓𝑋𝑖 (𝑥𝑖 ; 𝜇, 𝜎 2 ) = ∏ { exp [− ( ) ]}
𝑖=1 𝑖=1 √2𝜋𝜎 2 2 𝜎
𝑛 1 𝑛 𝑥𝑖 − 𝜇 2
= (2𝜋𝜎 2 )− 2 exp [− ∑ ( ) ]
2 𝑖=1 𝜎

And thus, the log-likelihood, 𝑙(𝜇, 𝜎 2 |𝒙) is derived as follows,


𝑛 1 𝑛 𝑥𝑖 − 𝜇 2
𝑙(𝜇, 𝜎 2 |𝒙) = ln {(2𝜋𝜎 2 )− 2 𝑒𝑥𝑝 [− ∑ ( ) ]}
2 𝑖=1 𝜎
𝑛 1 𝑛 𝑥𝑖 − 𝜇 2
= ln [(2𝜋𝜎 2 )− 2 ] + ln {𝑒𝑥𝑝 [− ∑ ( ) ]}
2 𝑖=1 𝜎
𝑛 𝑛 1 𝑛 𝑥𝑖 − 𝜇 2
= − ln(2𝜋) − ln(𝜎 2 ) − ∑ ( )
2 2 2 𝑖=1 𝜎

In order to estimate the parameters of 𝜇 and 𝜎 2 , we use Maximum likelihood method of


estimation, that is,
𝜕 1 𝑛 𝑥𝑖 − 𝜇 1 1 𝑛
𝑙(𝜇, 𝜎 2 |𝒙) = ∑ 2 ( ) ( ) = ( 2 ) ∑ (𝑥𝑖 − 𝜇) = 0
𝜕𝜇 2 𝑖=1 𝜎 𝜎 𝜎 𝑖=1

and,
𝜕 𝑛 1 1 𝑛 𝑛 1 𝑛
2 |𝒙) 2
2
𝑙(𝜇, 𝜎 = 0 − ( 2) + 4 ∑ (𝑥𝑖 − 𝜇) = − 2
+ 4 ∑ (𝑥𝑖 − 𝜇)2
𝜕𝜎 2 𝜎 2𝜎 𝑖=1 2𝜎 2𝜎 𝑖=1
=0

Hence, we obtain the estimate of 𝜇 (frequently formed as 𝜇̂ ) being


𝜇̂ = 𝑥̅
Assessing the unbiased estimator of 𝜇̂ , 𝑋̅.
∑𝑛𝑖=1 𝑋𝑖 ∑𝑛𝑖=1 𝐸(𝑋𝑖 ) ∑𝑛𝑖=1 𝜇
𝐸(𝑋̅) = 𝐸 ( )= = =𝜇
𝑛 𝑛 𝑛

Thus, 𝑋̅ is said to be unbiased estimator of 𝜇.

In order to estimate the variance, 𝜎 2 , we substitute the equation


𝜇̂ = 𝑥̅ into log-likelihood function. Then, the estimates value of variance can be obtained
by,
𝜕 𝑛 1 𝑛
2 |𝒙) (𝑥𝑖 − 𝑥̅ )2 = 0
𝑙(𝜇, 𝜎 = − + ∑
𝜕𝜎 2 2𝜎 2 2𝜎 4 𝑖=1

This yields,
1 𝑛
𝜎̌ 2 = ∑ (𝑥𝑖 − 𝑥̅ )2
𝑛 𝑖=1

Assessing the unbiased estimator of  2 , ~ 2 .


1 𝑛 1 𝑛
𝐸(𝜎̌ 2 ) = 𝐸 [ ∑ (𝑋𝑖 − 𝑋̅)2 ] = 𝐸 { ∑ [𝑋𝑖2 + 2(𝑋𝑖 )(−𝑋̅) + 𝑋 2 ]}
𝑛 𝑖=1 𝑛 𝑖=1
1 𝑛 1 𝜎2
= [∑ 𝐸(𝑋𝑖2 ) − 𝑛𝐸(𝑋̅ 2 )] = [𝑛(𝜎 2 + 𝜇 2 ) − 𝑛 ( + 𝜇 2 )]
𝑛 𝑖=1 𝑛 𝑛
𝑛−1 2
=( )𝜎
𝑛
1
𝜎̌ 2 = 𝑛 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 is biased estimator of 𝜎 2 , as 𝐸(𝜎̌ 2 ) ≠ 𝜎 2 . Here, we may define
𝑠 2 = 𝑎𝜎̌ 2 , that is unbiased estimator of 𝜎 2 .
𝑛−1 2
𝐸(𝑠 2 ) = 𝐸(𝑎𝜎̌ 2 ) = 𝑎 ( ) 𝜎 = 𝜎2
𝑛
Then,
𝑛
𝑎=
𝑛−1

Thus, the unbiased estimator of variance is derived as,


2 2
𝑛 1 𝑛
2
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2
𝑠 = 𝑎𝜎̌ = × ∑ (𝑥𝑖 − 𝑥̅ ) =
𝑛−1 𝑛 𝑖=1 𝑛−1

∑𝑛
𝑖=1(𝑥𝑖 −𝑥̅ )
2
𝑠2 = is unbiased estimator of 𝜎 2 .
𝑛−1
1.5 Additional rules of the normal distribution

#1 If 𝑍~𝑁(0,1) then 𝑍 2 ~𝜒12


𝑍
#2 If 𝑍~𝑁(0,1) and 𝐴~𝜒𝑎2 then ~𝑡𝑎
√𝐴/𝑎
#3 If 𝐴~𝜒𝑎2 and 𝐵~𝜒𝑏2 then 𝐴 + 𝐵~𝜒𝑎+𝑏2
𝐴/𝑎
#4 If 𝐴~𝜒𝑎2 and 𝐵~𝜒𝑏2 then 𝐵/𝑏 ~𝐹𝑎.𝑏

∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 can be expanded as below,


𝑛 𝑛 𝑛
∑ (𝑥𝑖 − 𝑥̅ )2 = ∑ (𝑥𝑖 − 𝜇 + 𝜇 − 𝑥̅ )2 = ∑ [(𝑥𝑖 − 𝜇) − (𝑥̅ − 𝜇)]2
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛
2
(𝑥𝑖 − 𝜇) − 2(𝑥̅ − 𝜇) ∑ (𝑥𝑖 − 𝜇) + ∑ (𝑥̅ − 𝜇)2
=∑
𝑖=1 𝑖=1 𝑖=1
𝑛
2
(𝑥𝑖 − 𝜇) − 𝑛(𝑥̅ − 𝜇) 2
=∑
𝑖=1

Dividing ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 by 𝜎 2 yields,


1 𝑛
2
𝑛 𝑥𝑖 − 𝜇 2 (𝑥̅ − 𝜇)2
∑ (𝑥 𝑖 − 𝑥̅ ) = ∑ ( ) −
𝜎2 𝑖=1 𝑖=1 𝜎 𝜎 2 /𝑛

According to #1;
𝑋𝑖 − 𝜇 𝑋𝑖 − 𝜇 2
𝑋𝑖 ~𝑁(𝜇, 𝜎 2)
→ ~𝑁(0,1) → ( ) ~𝜒12
𝜎 𝜎
Also,
2
𝑋̅ − 𝜇 𝑋̅ − 𝜇
𝑋̅~𝑁(𝜇, 𝜎 /𝑛) →
2
~𝑁(0,1) → ( ) ~𝜒12
𝜎/√𝑛 𝜎/√𝑛
According to #3;
𝑛 𝑋𝑖 − 𝜇 2 𝑋1 − 𝜇 2 𝑋2 − 𝜇 2 𝑋𝑛 − 𝜇 2
∑ ( ) =( ) +( ) + ⋯+ ( )
𝑖=1 𝜎 ⏟ 𝜎 ⏟ 𝜎 ⏟ 𝜎
𝜒12 𝜒12 𝜒12
Thus,
𝑛 𝑋𝑖 − 𝜇 2
∑ ( ) ~𝜒𝑛2
𝑖=1 𝜎

From the above equation, the first term of the right hand side is chi-square distribution
with 𝑛 degrees of freedom. The second term of the right hand side is chi-square
distribution with 1 degree of freedom. Thus, according to #3,
1 𝑛
2
𝑛 𝑥𝑖 − 𝜇 2 (𝑥̅ − 𝜇)2
∑ (𝑥𝑖 − 𝑥̅ ) = ∑ ( ) −
𝜎2 𝑖=1 ⏟ 𝑖=1 𝜎 ⏟𝜎 2 /𝑛
2
𝜒𝑛 𝜒12
1 𝑛
2
∴ ∑ (𝑥𝑖 − 𝑥̅ )2 ~𝜒𝑛−1
𝜎2 𝑖=1
These statistics are useful Analysis of Variance (ANOVA).

1.6 Estimation: Point estimation, interval estimation

Inferences about Population Central Values

Statistics
Objective: to make inferences about a population based on information contained in a
sample.

Typical population parameters:


mean 𝜇 = 𝑥̅ ,
median 𝑀 = 𝑥̃,
Variance σ2 = 𝑠 2 and a
proportion π =?.
Populations are characterized by numerical descriptive measures called parameter.

Methods for making inferences about parameters


- Estimate the value of the population parameter of interest.
- Test a hypothesis about the parameter

These two methods of statistical inference (estimation and hypothesis testing)


- Involve different procedures
- Answer two different questions about the parameters, i.e.
1. What is the value of the population parameter?
2. Is the parameter value equal to this specific value?
Estimation of μ
Compute a single value (statistic) from the sample data to estimate a population
parameter.
- Sample mean, 𝑥̅ as a point estimate of μ.
- From the central limit theorem and empirical rule, we know that the interval μ ±
2σ (or μ ± 1.96σ𝑥̅ ) includes 95% of the 𝑥̅ s in repeated sampling.

0.6
0.4
0.2
0.0

67 68 69 70 71 72 73

Figure 3.4: Normal distribution of 𝑋̅ with mean 𝜇 and variance 𝜎𝑋2̅ = 𝜎 2 /𝑛

It means that, any time 𝑥̅ falls in the interval 𝜇 ± 1.96𝜎𝑥̅ , the interval μ ± 1.96σ𝑥̅ will
contain the parameter 𝜇.
The probability of 𝑥̅ falling in the interval 𝜇 ± 1.96𝜎𝑥̅ is 0.95.
The 𝑥̅ ± 1.96𝜎𝑥̅ is an interval estimate of 𝜇 with level of confidence 0.95.
95% of the time in repeated sampling, intervals calculated using the formula 𝑥̅ ± 1.96𝜎𝑥̅
will contain the mean 𝜇.
0.6
n=25

0.5
0.4

n=10
0.3

n=5
0.2
0.1
0.0

-r1 -r2 -r3 r3 r2 r1

-3 -2 -1 0 1 2 3

Figure 3.5: normal distributions of 𝑋̅ with various values of sample size, 𝑛

Example 3.4:
The percentage calories from fat have the following information,
𝑥̅ = 36.92, 𝑠 = 6.73 and 𝑛 = 168.

Compute a 95% confident interval (CI) for the mean 𝜇.

Solution:
95% CI of 𝑥̅ (of we may denote it as 𝐶𝐼95% (𝑥̅ )) is,
𝑠 6.73
𝐶𝐼95% (𝑥̅ ) ≡ 𝑥̅ ± 𝑧2.5% × = 36.92 ± 1.96 × = (35.90,37.94)
√𝑛 √168

We are 95% certain (confident) that the average percent calories from fat is a value
between 35.90 and 37.94.

For a specified value of (1 − 𝛼), a 100(1 − 𝛼)% CI for 𝜇 is (with condition that 𝜎 2 is
𝜎
known) 𝑥̅ ± 𝑧𝛼/2 𝜎𝑥̅ , where 𝜎𝑥̅ = 𝑛.

CI for 𝜇 is positively increasing with the confident coefficient we choose.


𝑥̅ ± 2.58𝜎𝑥̅ forms a 99% CI for 𝜇
𝑥̅ ± 2.33𝜎𝑥̅ forms a 98% CI for 𝜇
𝑥̅ ± 1.96𝜎𝑥̅ forms a 95% CI for 𝜇
𝑥̅ ± 1.645𝜎𝑥̅ forms a 90% CI for 𝜇
Example 3.5:
A forester wishes to estimate the average number of “count trees” per acre (trees larger
than a specified size) on a 2,000-acre plantation. A random sample of 𝑛 = 50 1-acre plots
is selected and examined. The average (mean) number of count trees per acre is found to
be 27.3, with a standard deviation of 12.1.
Construct a 99% CI for 𝜇, the mean number of count trees per acre for the entire
plantation.

Solution:
The 99% Confidence interval for 𝜇 is
12.1
𝑥̅ ± 2.58𝜎𝑥̅ = 27.3 ± 2.58 ( ) = 27.3 ± 4.41
√50

We are 99% confident that the average number of count trees per acre is between 22.89
and 31.71.

Choosing the sample size for estimating 𝜇.


To consider
1. The tolerable error establishes the desired width of the interval
2. The level of confidence
3. If the CI of 𝜇 is too wide, estimation of 𝜇 will be imprecise, not informative.
4. A very low level of confidence – a CI that very likely will be in error – fail to
contain 𝜇.
5. To obtain a CI having a narrow width and a high level of confidence – large size
of data, 𝑛 (will require cost and time)

Suppose, to estimate 𝜇 using a 100(1 − 𝛼)% CI having tolerable error 𝑊


𝑊
- Interval: 𝑥̅ ± 𝐸, where 𝐸 = 2 with 𝑊 is the tolerable error (width) of the CI
- To determine the sample size, 𝑛
𝜎
𝐸 = 𝑧𝛼/2
√𝑛

then,
𝜎 2
𝑛 = (𝑧𝛼/2 )
𝐸

Confidence Interval: 𝐶𝐼100(1−𝛼)% = (𝑥̅ − 𝑧𝛼/2 𝜎𝑥̅ , 𝑥̅ + 𝑧𝛼/2 𝜎𝑥̅ )

Then, to fulfill the requirement of confident interval, the tolerable error, 𝑊 must be
greater or equal than the width of the interval, i.e.,
𝜎
𝑊 ≥ (𝑥̅ + 𝑧𝛼/2 𝜎𝑥̅ ) − (𝑥̅ − 𝑧𝛼/2 𝜎𝑥̅ ) = 2𝑧𝛼/2 𝜎𝑥̅ = 2𝑧𝛼/2 ×
√𝑛
So that,
𝑠 2
𝑛 ≥ (2𝑧𝛼/2 )
𝑊

This formula shows a sample size required for a 100 (1 −  ) %

CI for  of the form y  E

Example 3.6:
In the dietary intake example, the researchers wanted to estimate the mean percentage of
calories from fat (PCF) with a 95% CI having a tolerable error of 3. From previous
studies, the values of PCF ranged from 10%-50%. How many samples must the
researchers include in the sample to achieve their specification (case study: percentage of
calories from fat; Ott pg 193)

Solution:
Here, we have 𝑥𝑚𝑖𝑛 = 10% and 𝑥𝑚𝑎𝑥 = 50%. Hence, our estimate of standard deviation
can be formed as follows,
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 50 − 10
𝑠= = = 10
4 4

We want a 95% CI with tolerable error, 𝑊 = 3, by setting 𝛼 = 5%, then we may have
𝑧2.5% = 1.96. Then,
𝑠 2 10 2
𝑛 ≥ (2𝑧𝛼/2 ) = (2 × 1.96 × ) = 170.74
𝑊 3

So, a random sample of 171 samples should give a 95% CI for  with the desired width
of 3 provided 10 is a reasonable estimate of  .
Example 3.7:
A federal agency has decided to investigate the advertised weight printed on cartons on a
certain brand of cereal. The company in question periodically samples cartons of cereal
coming off the production line to check their weight. A summary of 1,500 of the weights
made available to the agency indicates a mean weight of 11.80 ounces per carton and a
standard deviation of 0.7 ounce. Find the number of cereal cartons the federal agency
must examine to estimate the average weight of cartons being produced now, using a
99% CI of width of 0.50.

Solution:
The federal agency has specified that the 𝑊 = 0.5,
Assuming that the weights made available to the agency by the company are accurate, we
have, 𝜎 = 0.7. For 99% CI, we set 𝛼 = 1% and determine 𝑧𝛼/2 = 2.58. Thus, the
required sample size is,
𝑠 2 0.7 2
𝑛 ≥ (2𝑧𝛼/2 ) = (2 × 2.58 × ) = 52.2
𝑊 0.5

So, the federal agency must obtain a random sample of 53 cereal cartons to estimate the
mean weight to within ±0.25.

You might also like