Professional Documents
Culture Documents
Unit X - Final Review - 4 Per Page
Unit X - Final Review - 4 Per Page
General Recap
• So let’s take a breathe. Where do we stand?
• We’ve been looking at LOTS of different inferential
analysis procedures.
• What’s the main difference between them?
Unit X: Final Exam Review – The type of data we have!
– Is the outcome data quantitative (means) or is it a
count/proportion from a yes/no setting (proportions).
– Do we have 1 sample, 2 samples, or more than 2?
– Are we trying to relate two (or more) quantitative
variables?
• All these questions can help identify the situation, and thus
lead to the correct analysis technique
1 2
1
8/2/2012
Intro to Inference
Inference (One-Sample for Means, σ known)
• Confidence Intervals
• Confidence Intervals
– General Formula x z*
– Formulas n
– Interpretation
– Interpretation
x 0
• Hypothesis Testing • Hypothesis Testing z
/ n
– Logic & Formulas – Logic, Formula, p-value, Assumptions
• We assume Xi ~ N(μ,σ) & independent
– p-values
• z ~ N(μ,σ/√n) [when null hypothesis (and assumptions) are true]
– Example:
• Application: deciding which procedure is appropriate • A sample of 5 Harvard students were found to have a average IQ
– This is the key!!! of 132 (it is known the true standard deviation for IQ testing is
15). Is this significantly higher than 120?
7 8
2
8/2/2012
Inference
(One-Sample for Means, σ known) Inference
(One-Sample for Means)
One sample z-test for μ CI for one sample μ
x 0 One sample t-test for μ CI for one sample μ
z
xz *
x 0
/ n n t x t df* n 1
s
• We assume Xi ~ N(μ,σ) & independent s/ n n
• z ~ N(0,1) [when null hypothesis (and assumptions) are true]
• Example: • We assume Xi ~ N(μ,σ) & independent
– A sample of 5 Harvard students were found to have a average IQ of • t ~ t(df = n–1) [if null hypothesis is true]
132 (it is known the true standard deviation for IQ testing is 15). Is
this significantly higher than 120? • Example:
• Power
– Determine the rejection region under Ho
– A sample of 25 Harvard students were found to sleep
– Calculate the probability of falling in that rejection region when Ha is on average 7.4 hours a night, with a standard
true deviation of 1.3 hours. Is this significantly lower than
– Increases with: i) larger sample size, ii) smaller σ, iii) further distance
between μA and μ0 the recommended amount of 8 hours a night?
9 10
Inference Inference
(Two-Sample for Means) (Two-Sample for Means)
slarger
• Should we pool or not? Yes, if Rule holds: 1.5 • Don’t forget the paired t-test!!!
ssmaller • Examples:
– Advantages vs. Disadvantages
– The average # calories consumed in one day for a random sample of 25
(n1 1) s12 (n2 1) s22 Harvard freshmen was 2,709 with s = 400. The average # calories
• Pooled Procedure ( s p )
n1 n2 2 consumed in one day for a random sample of 16 Harvard sophomores was
2,572 with s = 380. Is there evidence of a difference in the number of
2-sample pooled t-test 2-sample pooled t-based CI calories consumed by freshmen and sophomores?
( x1 x2 ) ( 1, 0 2, 0 ) 1 1
t ( x1 x2 ) t df* n1 n2 2 s p
sp
1 1
n1 n2 – Pairs of individuals (one male, one female) were selected from 10
n1 n2 randomly selected Harvard houses. The following caloric intake statistics
• Unpooled Procedure were calculated:
• Calculate 99% CI for the mean difference average sd
2-sample unpooled t-test 2-sample unpooled t-based CI
between the sexes in the number of calories Men 2952 522
( x1 x2 ) ( 1, 0 2, 0 ) s12 s22
t ( x1 x2 ) t df* min(n1 ,n 2 )-1 That Harvard students choose to consume.
Women 2245 407
s12 s22 n1 n2
Diff 707 388
n1 n2
11 12
3
8/2/2012
Inference Inference
(One-Sample for Proportions) (Two-Sample for Proportions)
One sample z-test CI of one proportion 2 sample z-test 2-sample z-based Conf. Int.
pˆ p0 ( pˆ 1 pˆ 2 ) (0)
z pˆ (1 pˆ ) z pˆ 1 (1 pˆ 1 ) pˆ 2 (1 pˆ 2 )
p0 (1 p0 ) pˆ z * 1 1 ( pˆ 1 pˆ 2 ) z *
n pˆ p (1 pˆ p ) n1 n2
n n1 n2
Hold only if np ≥ 10 and n(1-p) ≥ 10. Why? • Where: X1 X 2
Example: pˆ p
n1 n2
– In a random sample of 100 Harvard men, 25 were over 6 • Again, we are assuming the np ≥ 10, and n(1-p) ≥ 10 bounds
feet tall.
• Example:
• Estimate the 95% confidence interval for the proportion
of male Harvard students that are taller than 6 feet. – 19 out of 29 randomly sampled undergraduate women said they
• Is there evidence that this proportion is different than the prefer writing papers to taking exams, while only 20 out of 47 men
US as whole (where the proportion overall is 0.17)? preferred papers over exams. Is this evidence to support that
undergraduate women as a whole tend to prefer papers over exams
more than men?
13 14
(H0: Rows and Columns are Independent) – t-test of β coefficients price | Coef. Std.Er. t P>|t| [95% Conf.Int.]
----------+----------------------------------------------------
– F-test of entire model sqft | .23429 .09352 2.51 0.018 .04355 .42504
(Obs Exp) 2 (row total ) (col total ) _cons | 592.24 385.39 1.54 0.135 -193.77 1378.2
4000
all cells Exp
– R2 = r2
Simple Example: Are Smoking and Lung Cancer Associated?
3000
– Prediction and Confidence
Smoke intervals at a particular x*
2000
• Multiple Regression Only Topics
Heavy Light Never Total – Adjusted R2
1000
Lung Yes 33 20 7 60 – Step-Up Model Building
• Assumptions (4):
0
Cancer No 27 40 53 120 εi ~ N(0,σ) & independent 2000 4000
SqFt
6000 8000
4
8/2/2012
1000
. corr price sqft lot distance (obs=33) From Regression
500
| price sqft lot distance
-------------+------------------------------------
• All ANOVA Tables have the same basic form
0
price | 1.0000
sqft | 0.4103 1.0000
-500
lot |
distance |
0.7760
0.4913
0.3484
0.1856
1.0000
0.3807 1.0000
• Remember, Total Sums of Squares in y can be decomposed as:
1000 2000 3000
Linear prediction
SST = SSM + SSE
Residuals Fitted values
. regress price lot distance sqft Its all base on the prediction of the observations and the error:
n n n
(y y ) 2 ( yˆ i y ) 2 ( yi yˆ i ) 2
Source | SS df MS Num of obs = 33
.001
---------+------------------------ F( 3, 29) = 19.25
i
8.0e-04
Model | 11774453 3 3924817 Prob > F = 0.0000
Residual | 5912577 29 203882 R-squared = 0.6657 i 1 i 1 i 1
---------+------------------------ Adj R-sq = 0.6311
6.0e-04
ANOVA table in regression
Density
Total | 17687031 32 552720 Root MSE = 451.53
4.0e-04
--------------------------------------------------------- Source SS DF MS F
price | Coef. Std.Er. t P>|t| [95% Conf.Int]
2.0e-04
---------+-----------------------------------------------
Model SSM DFM = p MSM = SSM/DFM MSM/MSE
lot | .10559 .02008 5.26 0.000 .0645 .1467 Error
distance | 470.97 249.00 1.89 0.069 -38.30 980.2 SSE DFME = n - p - 1 MSE = SSE/DFE
0
sqft | .08341 .06552 1.27 0.213 -.0506 .2174
-1000 -500 0
Residuals
500 1000
(Residual)
_cons | 132.18 283.20 0.47 0.644 -447.0 711.4 Total SST DFT = n – 1 MST = SST/DFT
---------------------------------------------------------
17
------------------------------------------------------------
price | Coef. Std.Er. t P>|t| [95% Conf.Int] b) You are interested in estimating the average amount of household
2,000
price
---------+--------------------------------------------------
white | 656.72 269.15 2.44 0.021 107.04 1206.4 consumer debt among a segment of the US population, and you have a
cream | -295.85 302.57 -0.98 0.336 -913.8 322.08 random sample from that segment giving consumer debt in dollars per
1,000
19 20
5
8/2/2012
e) In the United States, there are approximately 900,000 HIV positive g) You suspect that small number of respondents to a survey have over-
individuals among the total population of approximately 293,000,000, stated their income. You are aware that income data tends to be right
so that the population rate of HIV disease is known to be 0.3%. You skewed, so you transform the data by taking the logarithm of each
suspect the rate is higher among some ethnic groups, especially value. You like to use a graphical display to look for the presence of
Hispanic males, and you have data on the HIV status from a random outliers in the transformed data.
sample of 200 Hispanic males.
21 22
.15
b) From this study, is there evidence to suggest there is a difference in (a) To the right is the
.1
histogram of the response
Density
frequency of use of curse words between the sexes in the general US
population? variable, runs_diff.
.05
Comment on the plot.
0
-10 -5 0 5 10
more than men? runs_diff
23 24
6
8/2/2012
25 26
4. Below is the summary statistics of the daily high temperature (˚F) in Boston, MA 5. An animal-behavior specialist studied tennis legend John McEnroe’s
from May 1st until August 2nd (n = 94). These high temperatures were split into days
behavior as he won the 1983 Wimbledon tennis tournament. Among
when it rains (precip = 1) vs. days it doesn’t rain (precip = 0).
the data gathered were whether McEnroe grunted on his serve, and
. tabulate precip, summarize(maxtemp) whether his serve was an ace (i.e., a specific way to win the point on
the serve), a fault (i.e., error), or other.
| Summary of maxtemp
precip | Mean Std. Dev. Freq.
------------+------------------------------------ (a) Suppose that when McEnroe serves, he remains silent with
0 | 79.196078 11.568958 51 probability 0.30. Also suppose that whether McEnroe is silent is
1 | 71.44186 12.177853 43 independent from serve to serve. Find the probability that out of 5
------------+------------------------------------
serves McEnroe remains silent at least 3 times.
Total | 75.648936 12.410287 94
(a) Perform an appropriate hypothesis test to determine whether there is a difference (b) With the same assumptions as in part (a), calculate the approximate
in high temperature on rainy vs. non-rainy days. probability that McEnroe is silent at least half the time out of 40
serves.
(b) The typical daily high temperature for these dates in Boston, MA is 75˚F. Perform
an appropriate hypothesis test to determine whether or not we have been lucky and had
a hotter average high temperature than normal.
27 28
7
8/2/2012