Professional Documents
Culture Documents
Correlation and Regression: General Principles of Data Analysis
Correlation and Regression: General Principles of Data Analysis
Regression
Scatterplots
Correlation
Explanatory and response variables
Simple linear regression
1
General Principles of Data Analysis
To understand the data, always start with a series
Plot your data of graphs
2
Bivariate Data Analysis
For two quantitative variables, use a scatterplot
Plot your data
3
Scatterplot
4
300
Heart Disease Deaths (per 100,000 people)
50 100 150 200 250
0 2 4 6 8 10
Alcohol with Wine (liters per person per year)
5
Using and Interpreting Correlation
Ranges from -1 to 1; values closer to |1| indicate
stronger linear relationship
Positive values indicate positive association
Does not distinguish between explanatory and
response variables
Requires that both variables be quantitative
It has no unit of measurement – because it uses
standardized values, it is scale free
Measures strength only of linear relationships
Like mean and standard deviation, strongly affected by
outlying observations
6
300
Heart disease death rate (per 100,000)
Vertical distances from least-squares
line are residuals
makes these
distances small
50
0 2 4 6 8 10
Alcohol with Wine (liters per person per year)
In Stata, obtain this graph with twoway (lfit heartdis alcohol) (scatter heartdis alcohol)
Regression in Stata
regress
regress heartdis
heartdis alcohol
alcohol
Source
Source || SS
SS df
df MS
MS Number
Number of
of obs
obs == 19
19
-------------+------------------------------
-------------+------------------------------ F(
F( 1,1, 17)
17) == 41.69
41.69
Model | 59813.5718
Model | 59813.5718 1 59813.5718
1 59813.5718 Prob >
Prob > FF == 0.0000
0.0000
Residual
Residual || 24391.3756
24391.3756 17
17 1434.7868
1434.7868 R-squared
R-squared = 0.7103
= 0.7103
-------------+------------------------------
-------------+------------------------------ Adj
Adj R-squared
R-squared == 0.6933
0.6933
Total | 84204.9474
Total | 84204.9474 18 4678.05263
18 4678.05263 Root MSE
Root MSE == 37.879
37.879
slope, b
------------------------------------------------------------------------------
------------------------------------------------------------------------------
heartdis
heartdis || Coef.
Coef. Std.
Std. Err.
Err. tt P>|t|
P>|t| [95%
[95% Conf.
Conf. Interval]
Interval]
-------------+----------------------------------------------------------------
-------------+----------------------------------------------------------------
alcohol | -22.96877
alcohol | -22.96877 3.55739
3.55739 -6.46
-6.46 0.000
0.000 -30.4742
-30.4742 -15.46333
-15.46333
_cons
_cons || 260.5634
260.5634 13.83536
13.83536 18.83
18.83 0.000
0.000 231.3733
231.3733 289.7534
289.7534
------------------------------------------------------------------------------
------------------------------------------------------------------------------
y-intercept, a
y^ = a + bx
Estimated heart disease death rate = 260.56 + (-22.97)(Per capita alcohol consumption)
7
Using and Interpreting Regression
Distinction between explanatory and response
variables is essential in regression
Regression of y on x ≠ regression of x on y
8
Residuals-versus-Fitted Plot
50
Residuals
0
-50
Residuals-versus-Predictor Plot
50
Residuals
0
-50
0 2 4 6 8 10
Alcohol with Wine (liters per person per year)
In Stata, obtain this plot after regress with rvpplot alcohol, yline(0)
9
95% Confidence Interval for Least-Squares Line
300
Heart disease deaths per 100,000 people
200
100
0
0 2 4 6 8 10
Alcohol with Wine (liters per person per year)
Obtain this graph with twoway (lfitci heartdis alcohol) (scatter heartdis alcohol)
10