Correlation and Linear Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Correlation

Linear Regression

Truong Phuoc Long, ph.D

1
Causation vs. Association

• Health sciences concern about determinants of health i.e.


what makes us healthy/unhealthy or Cause – Effect.

• Causation is implying that A and B have a cause-and-


effect relationship with one another.

• It’s very difficult to conclude a causal relationship.

2
Causation vs. Association (cont.)
• Hill’s criteria of causation (proposed by A. B. Hill)
- Strength of the association: strong/weak associations.
- Consistency of findings: refers to the repeated observation of an
association in different populations, different investigators,
different methods, etc.
- Specificity of the association: requires that a cause leads to a
single effect, not multiple effects. However, a single cause often
leads to multiple effect. Smoking is a perfect example.
- Temporal relationship: Exposure always precedes the outcome.
First exposure, then disease.

3
Causation vs. Association (cont.)
• Hill’s criteria of causation (cont.)
- Biological plausibility: refers to the biological plausibility of the
hypothesis.
 Finding is consistent with existing biological and medical
knowledge.
- Dose-Response Relationship: incremental change in disease rates in
conjunction with corresponding changes in exposure.
- Consideration of Alternate Explanations
- Coherence: causal interpretation fit with known facts with the current
knowledge of the natural history/biology of the disease.
 Experimental evident demonstrate that under controlled conditions,
changing the exposure causes a change in the outcome is of great value.
4
Association  Causation

In order to have a causation, there must be an association.


So far in this course, we have tried to show evidence of an
association (a relationship) between two variables.

Dependent variable Independent variable


aka outcome, response aka predictor, explanatory variable
Cholesterol Doing exercise
Head injury Wearing helmet
Tuberculosis Sharing needles
Cervical cancer Smoking

We haven’t dealt with the strength of association or dose-response.


5
Correlation analysis
• Investigates the relationship between to continuous variables
(continuously measured, no gaps).
• Concerned with measuring the strength and direction of the
association between variables. The correlation of X and Y (Y and X).
• Examples:
- Age & systolic blood pressure.
- Percentage of adults who have been immunized against the Covid-19
and the corresponding mortality rate.
- Total lung capacity (l) and height (m).
- Systolic BP (mm Hg) and salt intake (g)
Which one is independent variable and dependent variable?
6
Scatter plot of the 2 variables

Linear relationships Non-linear relationships

y y

x x

y y

x x
Before we conduct any type of analysis, we should always create a two-
7
way scatter plot of the data.
Scatter plot of the 2 variables

Strong relationships Weak relationships

y y

x x

y y

x x
8
Scatter plot of the 2 variables
No relationship

x
9
Correlation Analysis
Some specific names for “correlation” in one’s data:
• The population correlation coefficient: ρ (rho).
• The sample correlation coefficient: r
• Range from -1 to 1, unit free.
• The value of r can be substantially influenced by a small
fraction of outliers.

10
Correlation Analysis

• Measures the strength of the association (linear relationship)


between two (continuous) variables and the direction.
– How closely does a scatterplot of the two variables produce a
non-flat straight line?
• Exactly: r = 1 or -1
• Not at all (e.g. flat relationship): r = 0
• -1 ≤ r ≤ 1
– Does one variable tend to increase as the other increases (r > 0),
or decrease as the other increases (r < 0)

11
Correlation coefficient

12
Pearson’s Correlation Coefficient

Sample Pearson’s correlation coefficient (Correlation coefficient):

or the algebraic equivalent:

where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable 13
Map r value with its scatter plot

y y y

(A) x (B) x (C) x

y y 1. r = 0.3
2. r = 1
3. r = - 0.6
4. r = 0
x x 5. r = -1 14
(D) (E)
Examples of approximate r values

y y y

x x x
r = -1 r = -0.6 r=0
y y

x x
r = +0.3 r = +1 15
Significance Test for Correlation
H0: ρ = 0 (no correlation)
• Hypotheses
HA: ρ ≠ 0 (correlation exists)
What do the hypotheses mean in words?
- Null hypothesis: The population correlation coefficient is not
significantly different from zero. There is not a significant linear
relationship(correlation) between x and y in the population.
- Alternative hypothesis: The population correlation coefficient is
significantly different from zero. There is a significant linear
relationship (correlation) between x and y in the population.

• Test statistic

16
Example: r = ?

17
Example

18
Example

Is there evidence of a linear relationship


between y and x at the 0.05 level of
significance?

H0: ?
HA: ?
=0.05 , df = n – 2, r = 0. 886, n = 8
r
t=
-
1 r 2

n-2
19
Example

Is there evidence of a linear relationship between y


and x at the 0.05 level of significance?

H0: ρ = 0 (No correlation)


H1: ρ ≠ 0 (correlation exists)
 =0.05 , df = 8 - 2 = 6

r 0.886
t= = = 4.68
1- r 2 1- 0.886 2
n-2 8-2
20
17 21
Example

r .886
t= = = 4.68 Decision:
1- r 2 1- .886 2 Reject H0

n-2 8-2 Conclusion:


There is evidence
d.f. = 8 - 2 = 6
of a linear
/2=.025
relationship at the
/2=.025
5% level of
significance
Reject H0 Do not reject H0 Reject H 0
-tα/2 tα/2
2.4469 0 2.4469
4.68 22
Regression models
• Regression models describe the relationship between
variables by fitting a line to the observed data.
• Linear regression models use a straight line, while logistic
and nonlinear regression models use a curved line.
• Regression allows you to estimate how a dependent
variable changes as the independent variable(s) change.

23
Simple linear regression
• A technique to explore the nature of the relationship between two
continuous variables.

• Eg, is there a relationship between education level and income?

• Linear regression: Concerned with predicting the value of one


variable based on (given) the value of the other variable. The
regression of Y on X.

• Dependent variable: the variable for which we want to make a


prediction.

• Independent variable: the variable used to explain the dependent


variable.

24
Simple linear regression model

• Only one independent variable, X


• Relationship between X and Y is described by a linear
function.
• Changes in Y are assumed to be caused by changes in X

25
Linear regression

• Aims to predict the value of a health outcome, Y, based on the value


of an explanatory variable, X.

– What is the relationship between average Y and X?


• The analysis “models” this as a line
• We care about “slope”—size, direction
• Slope = 0 corresponds to “no association”

– How precisely can we predict a given person’s Y with his/her X?

26
Example of height and weight

• Look at this
scatter plot, what
do you think
about the trend
between height
and weight?
• What is the best
way to show this
trend?

27
How about a line?

 Which line?

28
The best fitting line

• The line closest to


all dots.
• How can we find
that line?
• Recall from
algebra, a line is
defined by:
y = ax + b
a: slope
b: intercept
29
Conditional population & conditional distribution

 Do all people having the same height have the same


weight?

30
Conditional population & conditional distribution

• A conditional population of Y values associated with


a fixed, or given, value of X.
• e.g. all people with different weights although having the same
height.
• A conditional distribution is the distribution of values within the
conditional population above.

31
The Linear Model

 Assumptions:
 Linearity
 Constant standard deviation Y |X

Where:
- y is the dependent or
response variable.
- x is the independent or
predictor variable.
-  is the error term in the
model.
32
Population Linear Regression

The population regression model:

33
Population Linear Regression (continued)

y y = β 0 + β1x + ε
Observed Value
of y for xi

εi Slope = β
Predicted Value Random Error
of y for xi
for this x value

Intercept = β0

xi x
34
Linear Regression Assumptions

• The underlying relationship between the x variable


and the y variable is linear.
• Error values (ε) are statistically independent.
• Error values are normally distributed for any given
value of x.
• The probability distribution of the errors has constant
variance.

35
Ordinary least square - OLS
Linear regression model

Linear regression parameter estimation

 The difference between the observed yi and the predicted ŷi is


called the prediction error or the residual.
ei = yi - ŷi
 Ordinary Least Square (OLS) is a method to find the best fitting
line for the data (find 0, 1) by minimizing the sum of the squares
of the residuals.

36
Regression Picture

yi
ŷi = xi + 
C A

B
y
B y
A
C
yi
Least squares estimation
gave us the line (β) that
x minimized C2
n n n

(y
i =1
i - y) 2
=  ( yˆ
i =1
i - y) 2
+  ( yˆ
i =1
i - yi ) 2

A2 B2 C2 R2=SSreg/SStotal
SStotal SSreg SSresidual
Total squared distance of Distance from regression line to Variance around the regression line
observations from naïve mean of y naïve mean of y Additional variability not explained
Total variation Variability due to x (regression) by x—what least squares method
aims to minimize
Ordinary least square - OLS

 Why squares?
 Because the sum of the residuals will cancel out
(that’s how we know our line is closest to all dots)

38
Ordinary least square - OLS
 Solving this minimization problem

39
The estimated linear regression

• yi values of dependent variable (weight) for observation i


(student i)
• xi values of independent variable (height) for observation i
• yˆi predicted value of yi (from the best fitting line)
• The linear regression line: The best fitting line

40
Simple linear regression

• Example: what is the regression line for the effect


of height on weight?

41
yˆ = -59.026 + 0.710 x

42
The coefficient of determination R2

 R2 is the square of the Pearson correlation coefficient (r).

 R2 shows the % of the variability among the observed


values of y is explained by the linear relationship
between X & Y.

 The relationship between the coefficient and r is

43
Interpretation

44
Note

45
Example: Y: arm circumference; X: height
yˆ is the average arm circumference for a group of children all of the
same height, x

46

You might also like