Professional Documents
Culture Documents
Biostat 621 Extra Help Session Oct 17 2018 - Annotated
Biostat 621 Extra Help Session Oct 17 2018 - Annotated
621
Extra Help Session
October 17, 2018
1) Distinguish between:
Sample statistics are descriptors of the values of the observations in a single sample
distribution:
e.g., x is the sample mean (measure of central tendency) and s is the sample
standard deviation (spread) calculated
Based on the data from our single sample, x is the best estimate of μ and
s is the best estimate of σ.
The population distribution describes the probability associated with possible values
of a random variable X (such as birthweight) in an entire population of N
observations (which we cannot observe).
The population standard deviation describes the “spread” in the population values.
c) The sample distribution (i.e. the distribution of the random variable X in a given
sample from a study)
The sample distribution describes the probability associated with possible values
of a random variable X (such as birthweight) in the observed sample of n
observations (which we can observe).
The sample standard deviation describes the “spread” in the sample values (of our
observed sample).
1
Biostatistics 140.621 Extra Help Session- Annotated
The sampling distribution of the sample mean, x , describes the probability associated
with possible values of the random variable x (such as mean birthweight) in the
theoretical possible samples of n observations. We cannot observe this – it is a
theoretical distribution such that the means of the x ’s is the true population mean (μ)
and the standard deviation of the x ’s (also known as the standard error of the sample
mean or “standard error of the mean) which is σ/√n.
The standard error of the mean (standard deviation of the x ’s describes the “spread”
in the possible sample means that are theoretically possible when sampling all
possible samples of size n.
3) If the sample size is approximately 25 or larger, the Central Limit Theorem says that:
d) The distribution of sample variances (from all possible samples of the given sample
size) is approximately normal. The sample distribution can be approximated by a t-
distribution.
The following are data from a study to investigate maternal postpartum depression (within one
year after childbirth) as a risk factor for family homelessness at 3-years of follow-up (American J
Public Health 2014; 104:1664).
Table 1 below shows the baseline clinical and socio-demographic characteristics of the study
sample by postpartum depression status as well as two 3-year follow-up outcomes: homeless and
risk of homelessness.
Women with Depression
During
the Postpartum Year
Yes No Total
Baseline Characteristics*
n (sample size) 375 2,599 2,974
Age (years) mean (SD) 24.4 (5.8) 25.1 (6.1) 25.0 (6.0)
3-Year Outcomes*
Homeless (%) 6 2 3
At risk of homelessness (%) 14 9 10
Note: n is the sample size in a group, mean =sample mean= x and SD = s (sample standard
deviation.
3) Assuming a normal distribution for age, what proportion in the group of women with
depression in the postpartum year would be expected to be age 34 years or older?
a) 0
b) 0.05
Let X= age in years. Then, P(X ≥ 34) =
X uˆ 34 24.4
P P(Z 1.66) 0.05
ˆ 5.8
We are estimating the proportion of women with depression who are 34 or older
in the population.
c) 0.08
d) 0.46
e) 0.95
4) What is the probability of obtaining a sample mean age of 34 years or older among women
with depression in the postpartum year?
a) 0
b) 0.05
c) 0.08
d) 0.46
e) 0.95
5) From the data in Table 1, the sample standard deviation for CES-D depression score for
women with postpartum depression is approximately:
a) 0.0029 points
b) 0.06 points
c) 0.12 points
d) 1.1 points = SD = s= which is the variability in CES-D depression scores in the
375 women with postpartum depression
e) 1.2 points
6) From the data in Table 1, the standard error of the mean CES-D depression score for
women with postpartum depression is approximately:
a) 0.0029 points
b) 0.06 points = s/√n = 1.1/√375 = 0.0568 points= which is the variability in mean
CESD-depression scores in the sampling distribution.
c) 0.12 points
d) 1.1 points
e) 1.2 points
7) A previous study found an average CES-D depression score of 1.00 point in women with
postpartum depression. If we hypothesize that this study represents women with postpartum
depression sampled from a population with a mean CES-D score of 1.00 point, what is your
conclusion?
Below is the Stata output for the test of the null hypothesis that the population mean
CES-D depression score of 1.00 point in women with postpartum depression. Interpret.
X 1 1.25 1
We can calculate the observed test statistic tobs = 4.4011
s / n 1.1/ 375
. ttesti 375 1.25 1.1 1
One-sample t test
----------------n--------- x -------s/√n--------------s----------------------------
| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
x | 375 1.25 .0568038 1.1 1.138305 1.361695
------------------------------------------------------------------------------
mean = mean(x) t = 4.4011
Ho: mean = 1 degrees of freedom = 374
Method 1:
With a two-sided α =0.05, the critical region is t374 > 1.97 and t374 < -1.97. The observed
tobs = 4.4011 falls in the rejection region. Thus, we reject Ho.
Method 2:
We also can calculate the p-value = P( x > 1.25 ) + P(( x < -1.25 )
= P(t > 4.4011) + P(t < -4.4011) = 0.00002 (using the t
applet) which is 0.0000
Thus, we reject Ho and conclude that, based on our data, that the mean depression
score among women with postpartum depression is statistically significantly different
from 1 unit.
Method 3:
We can also construct a 95% confidence interval for the true population mean
depression score (μ) based on:
We can inspect this 95% CI to examine whether the interval contains or overlaps the
hypothesized value of 1. It does not and we reject the Ho and conclude that the mean
CES-D depression score among women with postpartum depression is statistically
significantly different from 1 unit.
Note: the inference you make will be the same by all 3methods.
8) If we hypothesize that this study represents women with postpartum depression sampled
from a population with a mean CES-D score of 1.00 point, what is the p-value associated
with the corresponding two-sided test? Confirm that it agrees with the Stata output above.
1.25 1.00
a) PZ
1.1
375
The p-value is based on a t-statistic instead of a Z-statistic since σ is unknown.
1.25 1.00
b) 2 P Z
1.1
1.25 1.0 1.25 1.0 (1.25 1.0)
c) 2 P t374 df P t374 df P t374 df Correct
1.1 1.1 1.1
375 375 375
1.25 1.0
d) P t374 df This is the p-value from a one-sided test.
1.1
375
1.25 1.0
e) 2 P X
1.1
9) Below is the Stata output for a test of the null hypothesis that population mean baseline
CES-D score does not differ by postpartum depression status. Setting a two-sided
significance level of 0.05, one would conclude that: (Circle only one response).
a) Mean CES-D score is statistically significantly greater in women with versus without
postpartum depression, 95% confidence interval (CI): 1.20 to 1.28 points
b) Mean CES-D score is statistically significantly greater in women without versus with
postpartum depression, 95% CI: 1.20 to 1.28 points
c) Mean CES-D score is statistically significantly greater in women with versus without
postpartum depression, 95% CI: 1.14 to 1.36 points
d) Mean CES-D score is statistically significantly greater in women without versus with
postpartum depression, 95% CI: 1.14 to 1.36 points
e) Mean CES-D score does not differ significantly by postpartum depression status.
The wording “test of the null hypothesis that population mean baseline CES-D score does not
differ by postpartum depression status” is the same as a “test of the null hypothesis that there is
no difference in the population mean baseline CES-D score between women with postpartum
depression and women without post-partum depression”.
Interpret:
We first test Ho: σ1=σ2 or (σ1/σ2) =1 in order to decide whether to pool the two sample
variances or not pool:
. sdtesti 375 1.25 1.1 2599 1.24 1.2
Thus, if the 2-sided p-value for this test is > 0.05, then we conclude that the variances are
not statistically significantly different from each other and we decide to pool the variances
using a weighted average of the two sample variances in the calculation of the standard
error of the difference in sample means for the t-test of Ho: μ1=μ2
Thus, if the 2-sided p-value for this test is <0.05, then we conclude that the variances are
statistically significantly different from each other and we decide to keep the variance
estimates separate in the calculation of the standard error of the difference in sample
means for the t-test of Ho: μ1=μ2
In this example, we reject the null hypothesis of equal variance, and specify “, unequal” as
an option with the ttest command.
Method 1:
With a two-sided α =0.05, the critical region is t511 > 1.96 and t511 < -1.96. The observed
tobs = 0.1626 falls in the fall to rejection region. Thus, we fail to reject Ho and conclude
that there is no statistically significant difference in mean depression score by
postpartum status.
Method 2:
We also can calculate the p-value = P( x1 - x 2 > 0.1 ) + P(( x1 - x 2 < -0.01 )
= P(t > 0.2626) + P(t < -0.1626) = 0.8709 (using the t
applet) and we fail to reject Ho
Method 3:
We can also construct a 95% confidence interval for the true population mean
depression score (μ) based on:
1.12 1.22
X1 X 2 t 511df ,0.05/2s X1 X2 1.25 1.24 1.96
(0.11, 0.13)
375 2599
We can inspect this 95% CI to examine whether the interval contains or overlaps the
hypothesized value of 0. It does and we fail to reject the Ho and conclude that the mean
CES-D depression score among women with postpartum depression and those without
postpartum depression are not statistically significantly different.
Note: the inference you make will be the same by all 3methods.
10) What is the difference between a 95% confidence interval (CI) and a 95% bootstrap interval?
The construction of a 95% confidence interval relies on the assumption that the sampling
distribution is normal or approximately normal in distribution. Thus, one constructs it in generic
fashion as:
In contrast, the 95% bootstrap interval does not rely on the assumption of a distributional form.
It is based on:
a) Taking the original sample and then resampling with replacement 1,000 (or more) times
with the same sample size.
b) Constructing a histogram of the relative frequency distribution of the resulting 1,000
sample means.
c) Identifying the middle 95% of values – this interval is the bootstrap interval.
Final Examination
Formula Page 2
Important Formulas for Statistical Inference
x
Population z
One Sample
x 0
H0 : 0 z
n
x 0
t
s
n
ˆp p0
H 0 : p p0 z
p0 q0
n
Two Samples
x1 x2 0 x1 x2 0
H 0 : 1 2 0 z t
2
2 2
s12 s22
1
n1 n2 n1 n2
x1 x2 0
t
s p2 s p2
n1 n2
( n1 1) s12 ( n2 1) s22
where s p 2
n1 n2 2
d d0
H0 : d d0 t
sd
n