Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

8/2/2012

General Recap
• So let’s take a breathe. Where do we stand?
• We’ve been looking at LOTS of different inferential
analysis procedures.
• What’s the main difference between them?
Unit X: Final Exam Review – The type of data we have!
– Is the outcome data quantitative (means) or is it a
count/proportion from a yes/no setting (proportions).
– Do we have 1 sample, 2 samples, or more than 2?
– Are we trying to relate two (or more) quantitative
variables?
• All these questions can help identify the situation, and thus
lead to the correct analysis technique

1 2

Roadmap to Inference in Stat S100


σ known Inference for μ (z-based)
Exam Details
1 group
σ unknown
Inference for μ (t-based)
(use s) • Exam in Science Center, Lecture Hall B. It begins at 6:30pm
σ1 = σ2 Pooled t-procedure and runs for 3 full hours.
Independent
groups
σ1 ≠ σ2 Non-pooled t-procedure • Its open-book & open-notes. No laptops or cell phones.
Quantitative
2 groups Remember to bring a calculator.
• Its cumulative, but with ~3/4 of questions coming from units 6-
Data Paired groups Paired t-procedure

2 or more groups Binary Predictors Regression 10 (Chapters 6-11 in text).


One Predictor Simple linear regression • Be sure to briefly explain your answers and show calculations.
Quantitative predictor(s)
2+ Predictor(s) Multiple regression
• We encourage you to make a ‘cheat sheet’ with important
Start
formulas & ideas. Saves time from flipping through the text.
Here! – Note: many of the inference formulas are in these review
1 group Inference for p
notes, but it does not contain everything since the midterm
Binary Data
2 groups Inference for p1 – p2 • Extra office hours before the exam (SC-600, 601, or 602):
(yes/no)
2 or more groups χ2 test for association
– Fri: 2-4pm Online: 2-4pm
– Sat: 1-2, 4-5pm Online: 6-7pm
Quantitative predictor(s) Logistic Regression*
– Sun: 2-4pm Online: 9-10pm
Categorical Data
(2+ categories)
2 or more
groups
χ2 test for association – Mon: 2-6pm Online: 5-6pm
– Tues: 2-6pm Online 10-11am
*NOTE: Logistic Regression is NOT on the Final Exam
3 4

1
8/2/2012

Old Topics from the Midterm


• Summarizing Data New Topics Since the Midterm
– Univariate
– Bivariate Inference
• Normal Distribution
• One Sample Inference for a Mean
• Study Design
– Experimental Design and Randomization • Two Sample Inference for Means
– Surveys and Random Sampling • One Sample Inference for a Proportion
• Introduction to Probability (& Randomness) • Two Sample Inference for Proportions
– Conditional Probability, Independence, & Bayes’ Theorem • Chi-Squared Test for a Two-Way Table
• Introduction to Random Variables • Simple Regression
– Binomial Random Variables
• Multiple Regression
– Normal Random Variables
• Regression with Binary Predictors
• Normal Approximation to the Binomial
• Central Limit Theorem and Law of Large Numbers
• Intro to Inference (theory behind hypothesis tests and conf int’s)
5 6

Intro to Inference
Inference (One-Sample for Means, σ known)
• Confidence Intervals 
• Confidence Intervals
– General Formula x  z*
– Formulas n
– Interpretation
– Interpretation
x  0
• Hypothesis Testing • Hypothesis Testing z
/ n
– Logic & Formulas – Logic, Formula, p-value, Assumptions
• We assume Xi ~ N(μ,σ) & independent
– p-values
• z ~ N(μ,σ/√n) [when null hypothesis (and assumptions) are true]
– Example:
• Application: deciding which procedure is appropriate • A sample of 5 Harvard students were found to have a average IQ
– This is the key!!! of 132 (it is known the true standard deviation for IQ testing is
15). Is this significantly higher than 120?
7 8

2
8/2/2012

Inference
(One-Sample for Means, σ known) Inference
(One-Sample for Means)
One sample z-test for μ CI for one sample μ
x  0 One sample t-test for μ CI for one sample μ
z 
xz *
x  0
/ n n t x  t df*  n 1
s
• We assume Xi ~ N(μ,σ) & independent s/ n n
• z ~ N(0,1) [when null hypothesis (and assumptions) are true]
• Example: • We assume Xi ~ N(μ,σ) & independent
– A sample of 5 Harvard students were found to have a average IQ of • t ~ t(df = n–1) [if null hypothesis is true]
132 (it is known the true standard deviation for IQ testing is 15). Is
this significantly higher than 120? • Example:
• Power
– Determine the rejection region under Ho
– A sample of 25 Harvard students were found to sleep
– Calculate the probability of falling in that rejection region when Ha is on average 7.4 hours a night, with a standard
true deviation of 1.3 hours. Is this significantly lower than
– Increases with: i) larger sample size, ii) smaller σ, iii) further distance
between μA and μ0 the recommended amount of 8 hours a night?
9 10

Inference Inference
(Two-Sample for Means) (Two-Sample for Means)
slarger
• Should we pool or not? Yes, if Rule holds:  1.5 • Don’t forget the paired t-test!!!
ssmaller • Examples:
– Advantages vs. Disadvantages
– The average # calories consumed in one day for a random sample of 25
(n1  1) s12  (n2  1) s22 Harvard freshmen was 2,709 with s = 400. The average # calories
• Pooled Procedure ( s p  )
n1  n2  2 consumed in one day for a random sample of 16 Harvard sophomores was
2,572 with s = 380. Is there evidence of a difference in the number of
2-sample pooled t-test 2-sample pooled t-based CI calories consumed by freshmen and sophomores?
( x1  x2 )  ( 1, 0   2, 0 ) 1 1
t ( x1  x2 )  t df*  n1  n2  2 s p 
sp
1 1
 n1 n2 – Pairs of individuals (one male, one female) were selected from 10
n1 n2 randomly selected Harvard houses. The following caloric intake statistics
• Unpooled Procedure were calculated:
• Calculate 99% CI for the mean difference average sd
2-sample unpooled t-test 2-sample unpooled t-based CI
between the sexes in the number of calories Men 2952 522
( x1  x2 )  ( 1, 0   2, 0 ) s12 s22
t ( x1  x2 )  t df*  min(n1 ,n 2 )-1  That Harvard students choose to consume.
Women 2245 407
s12 s22 n1 n2
 Diff 707 388
n1 n2
11 12

3
8/2/2012

Inference Inference
(One-Sample for Proportions) (Two-Sample for Proportions)

One sample z-test CI of one proportion 2 sample z-test 2-sample z-based Conf. Int.
pˆ  p0 ( pˆ 1  pˆ 2 )  (0)
z pˆ (1  pˆ ) z pˆ 1 (1  pˆ 1 ) pˆ 2 (1  pˆ 2 )
p0 (1  p0 ) pˆ  z * 1 1 ( pˆ 1  pˆ 2 )  z * 
n pˆ p (1  pˆ p )   n1 n2
n  n1 n2 
 Hold only if np ≥ 10 and n(1-p) ≥ 10. Why? • Where: X1  X 2
Example: pˆ p 
n1  n2

– In a random sample of 100 Harvard men, 25 were over 6 • Again, we are assuming the np ≥ 10, and n(1-p) ≥ 10 bounds
feet tall.
• Example:
• Estimate the 95% confidence interval for the proportion
of male Harvard students that are taller than 6 feet. – 19 out of 29 randomly sampled undergraduate women said they
• Is there evidence that this proportion is different than the prefer writing papers to taking exams, while only 20 out of 47 men
US as whole (where the proportion overall is 0.17)? preferred papers over exams. Is this evidence to support that
undergraduate women as a whole tend to prefer papers over exams
more than men?
13 14

Chi-Squared Test Linear Regression


• Model Statements . regress price sqft

Source | SS df MS Number of obs = 33


Two-Way Tables  y   0  1 xi1  ...   p xip ----------+-------------------------
Model | 2977829 1 2977829
F( 1,
Prob > F
31) =
=
6.28
0.0177
Residual | 14709203 31 474490.4 R-squared = 0.1684
H0: P(X = j and Y = k) = P(X = j)P(Y = k). yi   0  1 xi1  ...   p xip   i ----------+-------------------------
Total | 17687031 32 552719.7
Adj R-squared
Root MSE
=
=
0.1415
688.83
HA: P(X = j and Y = k) ≠ P(X = j)P(Y = k). • Main Inference Concepts ---------------------------------------------------------------

(H0: Rows and Columns are Independent) – t-test of β coefficients price | Coef. Std.Er. t P>|t| [95% Conf.Int.]
----------+----------------------------------------------------
– F-test of entire model sqft | .23429 .09352 2.51 0.018 .04355 .42504

(Obs  Exp) 2 (row total )  (col total ) _cons | 592.24 385.39 1.54 0.135 -193.77 1378.2

 df2 ( J 1)( K 1)   Exp  (H0: β1 = β2 = …= 0) ---------------------------------------------------------------

overall total • Simple Regression Only Topics

4000
all cells Exp
– R2 = r2
Simple Example: Are Smoking and Lung Cancer Associated?

3000
– Prediction and Confidence
Smoke intervals at a particular x*

2000
• Multiple Regression Only Topics
Heavy Light Never Total – Adjusted R2

1000
Lung Yes 33 20 7 60 – Step-Up Model Building
• Assumptions (4):

0
Cancer No 27 40 53 120 εi ~ N(0,σ) & independent 2000 4000
SqFt
6000 8000

Total 60 60 60 180 (transforming your variables could price Fitted values


help fix 3 of these assumptions) 16
15

4
8/2/2012

Multiple Linear Regression


Analysis of Variance Tables

1000
. corr price sqft lot distance (obs=33) From Regression

500
| price sqft lot distance
-------------+------------------------------------
• All ANOVA Tables have the same basic form

0
price | 1.0000
sqft | 0.4103 1.0000

-500
lot |
distance |
0.7760
0.4913
0.3484
0.1856
1.0000
0.3807 1.0000
• Remember, Total Sums of Squares in y can be decomposed as:
1000 2000 3000
Linear prediction
SST = SSM + SSE
Residuals Fitted values

. regress price lot distance sqft Its all base on the prediction of the observations and the error:
n n n

(y  y ) 2   ( yˆ i  y ) 2   ( yi  yˆ i ) 2
Source | SS df MS Num of obs = 33

.001
---------+------------------------ F( 3, 29) = 19.25
i

8.0e-04
Model | 11774453 3 3924817 Prob > F = 0.0000
Residual | 5912577 29 203882 R-squared = 0.6657 i 1 i 1 i 1
---------+------------------------ Adj R-sq = 0.6311

6.0e-04
ANOVA table in regression

Density
Total | 17687031 32 552720 Root MSE = 451.53

4.0e-04
--------------------------------------------------------- Source SS DF MS F
price | Coef. Std.Er. t P>|t| [95% Conf.Int]

2.0e-04
---------+-----------------------------------------------
Model SSM DFM = p MSM = SSM/DFM MSM/MSE
lot | .10559 .02008 5.26 0.000 .0645 .1467 Error
distance | 470.97 249.00 1.89 0.069 -38.30 980.2 SSE DFME = n - p - 1 MSE = SSE/DFE

0
sqft | .08341 .06552 1.27 0.213 -.0506 .2174
-1000 -500 0
Residuals
500 1000
(Residual)
_cons | 132.18 283.20 0.47 0.644 -447.0 711.4 Total SST DFT = n – 1 MST = SST/DFT
---------------------------------------------------------
17

Regression with Binary Predictors Review Problems


In order to compare 3+ means, 1. For each of the situations described below, select the inference
we learned to model this as a | Summary of price technique (and whether it’s a hypo test or CI) or graphical display that
regression with (I-1) binary Color | Mean Std. Dev. Freq.
you believe is the most applicable,. If it is a statistical hypothesis test,
------------+----------------------------------
predictors. So if there are 3 cream | 1077.6786 523.16194 7 state the null and alternative hypotheses. (Define all terms specific to
groups, we need to use 2 binary other | 1373.5312 553.52618 16
0/1 predictors in our model. The white | 2030.2500 890.44811 10 the example, rather than just giving a response in general terms such as
intercept estimates the reference
------------+----------------------------------
Total | 1509.7803 743.45122 33
“μ1 = μ2”). Do not go into details of the computations required.
group’s mean, while the slopes
estimate differences between the a) You have data on the IQ scores of 100 sets of identical twins who were
other groups compared to the Source | SS df MS Number of obs = 33
---------+----------------------- F( 2, 30) = 4.84 separated at birth and were raised in high vs. low-income families. You
reference group. Model | 4312891 2 2156445 Prob > F = 0.0151
Residual | 13374139 30 445804.7 R-squared = 0.2438
want to test whether household income is associated with outcome of
4,000

---------+----------------------- Adj R-squared = 0.1934 the standardized IQ test.


Total | 17687031 32 552719.7 Root MSE = 667.69
3,000

------------------------------------------------------------
price | Coef. Std.Er. t P>|t| [95% Conf.Int] b) You are interested in estimating the average amount of household
2,000
price

---------+--------------------------------------------------
white | 656.72 269.15 2.44 0.021 107.04 1206.4 consumer debt among a segment of the US population, and you have a
cream | -295.85 302.57 -0.98 0.336 -913.8 322.08 random sample from that segment giving consumer debt in dollars per
1,000

_cons | 1373.5 166.92 8.23 0.000 1032.6 1714.4


------------------------------------------------------------ household for 2,000 households. You would like to include a margin of
error in your estimate.
0

cream other white

19 20

5
8/2/2012

f) It is commonly thought that hospitals that perform a larger number of


c) You have a random sample of 80 of the companies in the Forbes list of surgical procedures of a certain type also have a smaller rate of post-
the top 500 companies in the US economy. You would like to operative complications among patients who receive that surgery. In
examine whether assets, sales, number of employees and economic statistical terms, the number of operations of a given type and the post-
sector (manufacturing, telecommunication, etc) can be used to predict operative complication rate are negatively associated. You have data
pre-tax profits. from a random sample of hospitals giving the number of hip
replacement surgeries performed in the last year and the rate of post-
d) For the data you have in part c, you are unsure whether the distribution operative complications from that surgery for that year and you would
of pre-tax profits among these companies follows a normal like to estimate the association between these variables for this type of
distribution. You would like to examine this assumption. operation.

e) In the United States, there are approximately 900,000 HIV positive g) You suspect that small number of respondents to a survey have over-
individuals among the total population of approximately 293,000,000, stated their income. You are aware that income data tends to be right
so that the population rate of HIV disease is known to be 0.3%. You skewed, so you transform the data by taking the logarithm of each
suspect the rate is higher among some ethnic groups, especially value. You like to use a graphical display to look for the presence of
Hispanic males, and you have data on the HIV status from a random outliers in the transformed data.
sample of 200 Hispanic males.

21 22

3. An investigator is trying to determine whether the Red Sox play


Review Problems (cont.) better or worse when they play in front of large crowds. He decided to
2. While sitting outside the Science Center and eavesdropping on cell collect the following variables on each of the first n = 34 games:
phone calls, a stat S-100 student finds that only 11 of 27 men use a runs_diff - the differential in runs of (Red Sox runs) – (their opponent’s runs)
curse word during their conversations, while 24 of 33 women do. attendance – the attendance at the game (in thousands of people)
home – a binary variable indicating whether the game was played at home at
Fenway (home = 1) or elsewhere (home = 0)
a) Calculate the 90% CI for the difference between the sexes in the
proportion of people who use curse words on the phone.

.15
b) From this study, is there evidence to suggest there is a difference in (a) To the right is the

.1
histogram of the response

Density
frequency of use of curse words between the sexes in the general US
population? variable, runs_diff.

.05
Comment on the plot.

c) What may be confounded in this study? Do women really curse

0
-10 -5 0 5 10
more than men? runs_diff

23 24

6
8/2/2012

. regress runs_diff attendance


. regress runs_diff attendance home
Source | SS df MS Number of obs = 34
-------------+----------------------- F( 1, 32) = 4.28 Source | SS df MS Number of obs = 34
Model | 101.490 1 101.490 Prob > F = 0.0467 -------------+----------------------- F( 2, 31) = 2.47
Residual | 758.892 32 23.7154 R-squared = 0.1180 Model | 118.261 2 59.1306 Prob > F = 0.1011
-------------+----------------------- Adj R-squared = 0.0904 Residual | 742.121 31 23.9394 R-squared = 0.1375
Total | 860.382 33 26.0722 Root MSE = 4.8698 -------------+----------------------- Adj R-squared = 0.0818
Total | 860.382 33 26.0722 Root MSE = 4.8928
---------------------------------------------------------------
runs_diff | Coef. Std.Er. t P>|t| [95% Conf.Int] ---------------------------------------------------------------
-------------+------------------------------------------------- runs_diff | Coef. Std.Er. t P>|t| [95% Conf.Int]
attendance | .23321 .11273 0.047 .00358 .46285 -------------+-------------------------------------------------
_cons | -7.4958 3.9821 0.069 -15.607 .61554 attendance | .18756 .12571 1.49 0.146 -.06883 .44396
--------------------------------------------------------------- home | 1.5590 1.8626 0.84 0.409 -2.2399 5.3579
_cons | -6.6986 4.1127 -1.63 0.113 -15.086 1.6892
---------------------------------------------------------------
(b) Does attendance appear to be a significant predictor of score
differential? (d) Compare the multiple regression model above to the simple
regression model in the previous slide. Is attendance still a
(c) You decide to attend a Red Sox game where 38 thousand fans
significant predictor of Red Sox score differential?
attend. Based on the output below and the regression above, give a
95% prediction interval for the score differential for this one game.
. summarize attendance (e) Which model do you think is more accurate? Is there really an
Variable | Obs Mean Std. Dev. Min Max
attendance effect on Red Sox performance?
-------------+---------------------------------------------
attendance | 34 34.5374 7.5197 18.65 46.81

25 26

4. Below is the summary statistics of the daily high temperature (˚F) in Boston, MA 5. An animal-behavior specialist studied tennis legend John McEnroe’s
from May 1st until August 2nd (n = 94). These high temperatures were split into days
behavior as he won the 1983 Wimbledon tennis tournament. Among
when it rains (precip = 1) vs. days it doesn’t rain (precip = 0).
the data gathered were whether McEnroe grunted on his serve, and
. tabulate precip, summarize(maxtemp) whether his serve was an ace (i.e., a specific way to win the point on
the serve), a fault (i.e., error), or other.
| Summary of maxtemp
precip | Mean Std. Dev. Freq.
------------+------------------------------------ (a) Suppose that when McEnroe serves, he remains silent with
0 | 79.196078 11.568958 51 probability 0.30. Also suppose that whether McEnroe is silent is
1 | 71.44186 12.177853 43 independent from serve to serve. Find the probability that out of 5
------------+------------------------------------
serves McEnroe remains silent at least 3 times.
Total | 75.648936 12.410287 94

(a) Perform an appropriate hypothesis test to determine whether there is a difference (b) With the same assumptions as in part (a), calculate the approximate
in high temperature on rainy vs. non-rainy days. probability that McEnroe is silent at least half the time out of 40
serves.
(b) The typical daily high temperature for these dates in Boston, MA is 75˚F. Perform
an appropriate hypothesis test to determine whether or not we have been lucky and had
a hotter average high temperature than normal.

27 28

7
8/2/2012

6. The result of a regression with binary predictors is shown below. It is


(c) In the actual tournament, among all his serves that resulted in aces, estimating the number of wins in 2011 by NFL teams based on the
McEnroe grunted 61 times, and remained silent 35 times. Determine division they are in (east, north, south, or west).
a 95% confidence for the true probability that McEnroe grunts when . regress wins north south west
he serves an ace. Source | SS df MS Number of obs = 32
---------+---------------------- F( 3, 28) =
Model | 14.5 3 4.8333 Prob > F = 0.7357
Residual | 317.5 28 11.340 R-squared = 0.0437
(d) The researcher categorized 333 serves by McEnroe during the ---------+---------------------- Adj R-squared = -0.0588
tournament. The complete data on McEnroe’s grunting behavior Total | 332 31 10.710 Root MSE = 3.3674
while serving is displayed in the table of counts below. ---------------------------------------------------------------
wins | Coef. Std.Err. t P>|t| [95% Conf. Int.]
---------+-----------------------------------------------------
north | 1.25 1.6837 0.74 0.464 -2.1989 4.6989
south | -.5 1.6837 -0.30 0.769 -3.9489 2.9489
west | -.25 1.6837 -0.15 0.883 -3.6989 3.1989
_cons | 7.875 1.1906 6.61 0.000 5.4363 10.314
---------------------------------------------------------------
a) Based on the above model, how many wins are teams from the south
Test at the α = 0.05 level whether McEnroe’s grunting is independent of
estimated to have?
the outcomes of his serve.
b) Which division had the highest mean number of wins?
c) Is there a statistically significant difference in mean number of wins
29 among these 4 groups? (be sure to calculate the correct statistic).

You might also like