Correlation and Linear Regression

Correlation
Linear Regression
Truong Phuoc Long, ph.D
1
Causation vs. Association
• Health sciences concern about determinants of health i.e.

what makes us healthy/unhealthy or Cause – Effect.
• Causation is implying that A and B have a cause-and-

effect relationship with one another.
• It’s very difficult to conclude a causal relationship.
2
Causation vs. Association (cont.)
• Hill’s criteria of causation (proposed by A. B. Hill)
- Strength of the association: strong/weak associations.
- Consistency of findings: refers to the repeated observation of an
association in different populations, different investigators,
different methods, etc.
- Specificity of the association: requires that a cause leads to a
single effect, not multiple effects. However, a single cause often
leads to multiple effect. Smoking is a perfect example.
- Temporal relationship: Exposure always precedes the outcome.
First exposure, then disease.
3
Causation vs. Association (cont.)
• Hill’s criteria of causation (cont.)
- Biological plausibility: refers to the biological plausibility of the
hypothesis.
 Finding is consistent with existing biological and medical
knowledge.
- Dose-Response Relationship: incremental change in disease rates in
conjunction with corresponding changes in exposure.
- Consideration of Alternate Explanations
- Coherence: causal interpretation fit with known facts with the current
knowledge of the natural history/biology of the disease.
 Experimental evident demonstrate that under controlled conditions,
changing the exposure causes a change in the outcome is of great value.
4
Association  Causation
In order to have a causation, there must be an association.

So far in this course, we have tried to show evidence of an
association (a relationship) between two variables.
Dependent variable Independent variable

aka outcome, response aka predictor, explanatory variable
Cholesterol Doing exercise
Head injury Wearing helmet
Tuberculosis Sharing needles
Cervical cancer Smoking
We haven’t dealt with the strength of association or dose-response.

5
Correlation analysis
• Investigates the relationship between to continuous variables
(continuously measured, no gaps).
• Concerned with measuring the strength and direction of the
association between variables. The correlation of X and Y (Y and X).
• Examples:
- Age & systolic blood pressure.
- Percentage of adults who have been immunized against the Covid-19
and the corresponding mortality rate.
- Total lung capacity (l) and height (m).
- Systolic BP (mm Hg) and salt intake (g)
Which one is independent variable and dependent variable?
6
Scatter plot of the 2 variables
Linear relationships Non-linear relationships
y y
x x
y y
x x
Before we conduct any type of analysis, we should always create a two-
7
way scatter plot of the data.
Strong relationships Weak relationships
y y
x x
y y
x x
8
No relationship
x
9
Correlation Analysis
Some specific names for “correlation” in one’s data:
• The population correlation coefficient: ρ (rho).
• The sample correlation coefficient: r
• Range from -1 to 1, unit free.
• The value of r can be substantially influenced by a small
fraction of outliers.
10
Correlation Analysis
• Measures the strength of the association (linear relationship)

between two (continuous) variables and the direction.
– How closely does a scatterplot of the two variables produce a
non-flat straight line?
• Exactly: r = 1 or -1
• Not at all (e.g. flat relationship): r = 0
• -1 ≤ r ≤ 1
– Does one variable tend to increase as the other increases (r > 0),
or decrease as the other increases (r < 0)
11
Correlation coefficient
12
Pearson’s Correlation Coefficient
Sample Pearson’s correlation coefficient (Correlation coefficient):
or the algebraic equivalent:
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable 13
Map r value with its scatter plot
y y y
(A) x (B) x (C) x
y y 1. r = 0.3
2. r = 1
3. r = - 0.6
4. r = 0
x x 5. r = -1 14
(D) (E)
Examples of approximate r values
y y y
x x x
r = -1 r = -0.6 r=0
y y
x x
r = +0.3 r = +1 15
Significance Test for Correlation
H0: ρ = 0 (no correlation)
• Hypotheses
HA: ρ ≠ 0 (correlation exists)
What do the hypotheses mean in words?
- Null hypothesis: The population correlation coefficient is not
significantly different from zero. There is not a significant linear
relationship(correlation) between x and y in the population.
- Alternative hypothesis: The population correlation coefficient is
significantly different from zero. There is a significant linear
relationship (correlation) between x and y in the population.
• Test statistic
16
Example: r = ?
17
Example
18
Example
Is there evidence of a linear relationship

between y and x at the 0.05 level of
significance?
H0: ?
HA: ?
=0.05 , df = n – 2, r = 0. 886, n = 8
r
t=
-
1 r 2
n-2
19
Example
Is there evidence of a linear relationship between y

and x at the 0.05 level of significance?
H0: ρ = 0 (No correlation)

H1: ρ ≠ 0 (correlation exists)
 =0.05 , df = 8 - 2 = 6
r 0.886
t= = = 4.68
1- r 2 1- 0.886 2
n-2 8-2
20
17 21
Example
r .886
t= = = 4.68 Decision:
1- r 2 1- .886 2 Reject H0
n-2 8-2 Conclusion:

There is evidence
d.f. = 8 - 2 = 6
of a linear
/2=.025
relationship at the
/2=.025
5% level of
significance
Reject H0 Do not reject H0 Reject H 0
-tα/2 tα/2
2.4469 0 2.4469
4.68 22
Regression models
• Regression models describe the relationship between
variables by fitting a line to the observed data.
• Linear regression models use a straight line, while logistic
and nonlinear regression models use a curved line.
• Regression allows you to estimate how a dependent
variable changes as the independent variable(s) change.
23
Simple linear regression
• A technique to explore the nature of the relationship between two
continuous variables.
• Eg, is there a relationship between education level and income?
• Linear regression: Concerned with predicting the value of one

variable based on (given) the value of the other variable. The
regression of Y on X.
• Dependent variable: the variable for which we want to make a

prediction.
• Independent variable: the variable used to explain the dependent

variable.
24
Simple linear regression model
• Only one independent variable, X

• Relationship between X and Y is described by a linear
function.
• Changes in Y are assumed to be caused by changes in X
25
Linear regression
• Aims to predict the value of a health outcome, Y, based on the value

of an explanatory variable, X.
– What is the relationship between average Y and X?

• The analysis “models” this as a line
• We care about “slope”—size, direction
• Slope = 0 corresponds to “no association”
– How precisely can we predict a given person’s Y with his/her X?
26
Example of height and weight
• Look at this
scatter plot, what
do you think
about the trend
between height
and weight?
• What is the best
way to show this
trend?
27
How about a line?
 Which line?
28
The best fitting line
• The line closest to

all dots.
• How can we find
that line?
• Recall from
algebra, a line is
defined by:
y = ax + b
a: slope
b: intercept
29
Conditional population & conditional distribution
 Do all people having the same height have the same

weight?
30
Conditional population & conditional distribution
• A conditional population of Y values associated with

a fixed, or given, value of X.
• e.g. all people with different weights although having the same
height.
• A conditional distribution is the distribution of values within the
conditional population above.
31
The Linear Model
 Assumptions:
 Linearity
 Constant standard deviation Y |X
Where:
- y is the dependent or
response variable.
- x is the independent or
predictor variable.
-  is the error term in the
model.
32
Population Linear Regression
The population regression model:
33
Population Linear Regression (continued)
y y = β 0 + β1x + ε
Observed Value
of y for xi
εi Slope = β
Predicted Value Random Error
of y for xi
for this x value
Intercept = β0
xi x
34
Linear Regression Assumptions
• The underlying relationship between the x variable

and the y variable is linear.
• Error values (ε) are statistically independent.
• Error values are normally distributed for any given
value of x.
• The probability distribution of the errors has constant
variance.
35
Ordinary least square - OLS
Linear regression model
Linear regression parameter estimation
 The difference between the observed yi and the predicted ŷi is

called the prediction error or the residual.
ei = yi - ŷi
 Ordinary Least Square (OLS) is a method to find the best fitting
line for the data (find 0, 1) by minimizing the sum of the squares
of the residuals.
36
Regression Picture
yi
ŷi = xi + 
C A
B
y
B y
A
C
yi
Least squares estimation
gave us the line (β) that
x minimized C2
n n n
(y
i =1
i - y) 2
=  ( yˆ
i =1
i - y) 2
+  ( yˆ
i =1
i - yi ) 2
A2 B2 C2 R2=SSreg/SStotal
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to Variance around the regression line
observations from naïve mean of y naïve mean of y Additional variability not explained
Total variation Variability due to x (regression) by x—what least squares method
aims to minimize
 Why squares?
 Because the sum of the residuals will cancel out
(that’s how we know our line is closest to all dots)
38
 Solving this minimization problem
39
The estimated linear regression
• yi values of dependent variable (weight) for observation i

(student i)
• xi values of independent variable (height) for observation i
• yˆi predicted value of yi (from the best fitting line)
• The linear regression line: The best fitting line
40
Simple linear regression
• Example: what is the regression line for the effect

of height on weight?
41
yˆ = -59.026 + 0.710 x
42
The coefficient of determination R2
 R2 is the square of the Pearson correlation coefficient (r).
 R2 shows the % of the variability among the observed

values of y is explained by the linear relationship
between X & Y.
 The relationship between the coefficient and r is
43
Interpretation
44
Note
45
Example: Y: arm circumference; X: height
yˆ is the average arm circumference for a group of children all of the
same height, x
46

Correlation and Linear Regression

Uploaded by

Copyright:

Available Formats

You might also like

Correlation and Linear Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation and Linear Regression

Uploaded by

Copyright:

Available Formats

Correlation

Truong Phuoc Long, ph.D

• Health sciences concern about determinants of health i.e.

• Causation is implying that A and B have a cause-and-

• It’s very difficult to conclude a causal relationship.

In order to have a causation, there must be an association.

Dependent variable Independent variable

We haven’t dealt with the strength of association or dose-response.

Linear relationships Non-linear relationships

Strong relationships Weak relationships

• Measures the strength of the association (linear relationship)

Sample Pearson’s correlation coefficient (Correlation coefficient):

or the algebraic equivalent:

(A) x (B) x (C) x

Is there evidence of a linear relationship

Is there evidence of a linear relationship between y

H0: ρ = 0 (No correlation)

n-2 8-2 Conclusion:

• Eg, is there a relationship between education level and income?

• Linear regression: Concerned with predicting the value of one

• Dependent variable: the variable for which we want to make a

• Independent variable: the variable used to explain the dependent

• Only one independent variable, X

• Aims to predict the value of a health outcome, Y, based on the value

– What is the relationship between average Y and X?

– How precisely can we predict a given person’s Y with his/her X?

• The line closest to

 Do all people having the same height have the same

• A conditional population of Y values associated with

The population regression model:

• The underlying relationship between the x variable

Linear regression parameter estimation

 The difference between the observed yi and the predicted ŷi is

• yi values of dependent variable (weight) for observation i

• Example: what is the regression line for the effect

 R2 is the square of the Pearson correlation coefficient (r).

 R2 shows the % of the variability among the observed

 The relationship between the coefficient and r is

You might also like