Professional Documents
Culture Documents
Chapter 4-Correlation and Regresssion
Chapter 4-Correlation and Regresssion
By
Abebe Feyissa
1
(PhD Candidate, Epidemiology & Biostatistics)
2 Content
Objectives of this chapter
Part I: General introduction to correlation and
regression analysis
Part II: Correlation analysis
Part III: Regression analysis
3 Objectives of this Chapter
After studying this topics, the student will be able to:
1. Explain when to use correlation and regression techniques to answer research questions or
test hypotheses
2. Understand assumptions behind the correlation and regression analysis
3. Understand the principles and procedures of Pearson correlation and regression coefficients
4. Determine whether Pearson correlation and regression coefficients are statistically
significant
5. Use SPSS to compute the Pearson correlation and regression coefficients and correctly
interpret the outputs
6. Report a correlation and regression coefficients in terms of the direction and strength of
association and its statistical significance
4
Part I: Introduction
5 Introduction
Variable classification
It is useful to select which model of data analysis to be used
There are three methods of variable classification
1. Gap (continuous vs discrete)
2. Descriptive (predictive property): dependent vs independent
3. Level of measurement (nominal, ordinal, interval, ratio)
6 Introduction…
There are two types of correlation coefficients. Both of them measure association
The Pearson correlation coefficient is used when both variables are normally
distributed
It is a parametric test that measures the association of two variables at the
interval or ratio scale
The Spearman correlation coefficient is used when one or both of the variables
are not normally distributed
It is a nonparametric test that measures the association of two variables at the
ordinal, interval, or ratio scale
13 Introduction to Correlation Analysis…
Hypothesis stating
As with most statistical tests, first we have to state the null and alternate
hypotheses
Example
Within a population of diabetic patients, two variables have been measured:
Fasting blood glucose (mmol/lt)
Mean circumferential shortening velocity of the left ventricle (Vcf)
(%/sec)
We want to study the relation between these two variables
15 Introduction to Correlation Analysis…
Example...
What is the relationship between fasting blood glucose and mean circumferential
shortening velocity of the left ventricle?
Graphically (Scatter plot)
Numerically
Correlation (Correlation coeficient)
Linear regression (Regression coefficient)
How do you measure the strength of the relationship?
Correlation coefficient or regression coefficient
16 Assumptions of Pearson correlation
1. The two variables are on either the interval or ratio measurement scales
2. The two variables are normally distributed
3. The two variables are related to each other in a linear fashion (straight
line)
4. There are no outliers (variables that fall outside of the pattern of the rest of
the data)
17 When A Pearson correlation coefficient can be
used?
For each value of one variable, the distribution of the other variable is normal
For every value of the first measure (X), the distribution of the second
measure (Y) must have equal variance, and for every value of Y, the
distribution of X must have equal variance (This is called the assumption of
homoscedasticity)
19 Scatter plot
graphically
vcf
1,50
1,20
r
( X X )(Y Y )
i i
( X X ) (Y Y )
i
2
i
2
21 Mathematical properties of correlation
Coefficient (r)
r=0 r = -1
r = +1
24 Measuring Strength and Direction of
Association…
The square of the correlation coefficient (r2), tells us how much variation is shared by the
two variables
The quantity of r2 is varies between 0 and 1, since r is varies between -1 and 1
As illustrated by the Venn diagram below, the shaded area shows the shared variance (r2)
between variables A and B, and the clear area shows the unique variance of each variable
25 Measuring Strength and Direction of
Association…
It has the t distribution with n-2 degrees of freedom when the null hypothesis is
true
27 Test of hypothesis and Ch/x for ρ
If the distribution of the sample correlation coefficient r is not normal, we use
Fisher’s z transformation to determine confidence interval
1 r 1
z 0.5 ln( ) is normally distributed with standard error
1 r n -3
Make a 95% CI for z: [z1; z2] Z1 = 0.5ln ± 1.96
Z = 0.5ln ± 1.96
2
e 2 z1 1 e 2 z2 1
And use the back transformation to determine the CI for r: , 2 z2
e 1 e 1
2 z1
28 Test of hypothesis and Ch/x for ρ
Example: If sample correlation coefficient between systolic blood pressure and age
is 0.66, the sample size n = 30 and the ρ = 0.85, calculate the statistic (T) and 95%
confidence interval for ρ, then decide whether it is significantly correlated or not
= = 4.62
Z = 0.5ln ± 1.96
95% CI for Z = 0.793 ± 0.377 = (0.416, 1.17)
e 2 x 0.416 1 e 2 x1.17 1
, 2 x1.17
e 2 x 0.416
1 e 1
95% CI for ρ = (0.394, 0.824)
29 Steps of Pearson correlation analysis
H0: The test scores on the screening test are not correlated with
the final course grade (r = 0)
H1: The test scores on the screening test are correlated with the
final course grade (r ≠ 0)
33 2. Define the Significance level
3. Explain the difference between testing the significance of R2 and the significance
of a regression coefficient (ß)
4. Discuss methods for selecting variables for entry into a linear regression model
44 Objective of regression analysis
6. Know the statistical assumption for a straight line model and describe
testing regression assumptions
Applications of regression
1. To characterize the relationship between dependent and independent
variables by determining the extent, direction and strength of association
2. To determine equation that describe the dependent variable (Y) as a
function of independent variable(s)
3. To describe the relationship between dependent and independent
variables while controlling the effect of other variables (confounders)
4. To determine which of the independent variables are important and
which are not for describing or predicting a dependent variable
49 Introduction…
5. To describe the best mathematical model for describing the relationship between
variable and one or more independent variables
7. To assess the interactive effects of two or more independent variables with regard
to a dependent variable (eg. To know the relationship of alcohol consumption to
blood pressure level is different depending on the smoking habits)
Linear regression models use the correlation between variables and the notion of a
straight line to develop a prediction equation
In simple linear regression (one dependent variable and one independent
variable), the relationship can be graphed as a line
In multiple linear regression (one dependent variable and multiple independent
variables), the shape is not really a line
If there are three variables, the shape is a plane, and if there are four or more
variables, it is impossible to visualize or graph
However, by convention, we still refer to the regression equation as a regression
'line'
53 Linear Regression model assumption
1. Existence: for any fixed value of the variable X, Y is a random variable with
certain probability distribution having finite mean and variance
The population mean of this distribution will be denoted and the population
variance as
The notation Y/X indicates that the mean and variance of the random variable Y
depend on the value of X
54 Linear Regression model assumption…
2. Independence: The Y-values are statistically independent of one another (the value
of one participant must not be influenced by another participants)
3. Linearity: The mean value of Y, is a straight line function of X
Y = a + bx or = X
Where and are the intercept and slope of the straight line respectively
The above equation can be expressed as:
=X+
55 Linear Regression model assumption…
The simple (one independent variable) regression equation is the equation for
a straight line and is written as
Y′ = a + bX
Y′ is the predicted score for the dependent variable
Letter a in the model is called the intercept constant, also referred to as alpha,
and is the value of Y when X = 0
Letter b represents the regression coefficient, also referred to as beta, and is
the rate of change in Y with a unit change in X. It is a measure of the slope of
the regression line
59 Simple Linear Regression…
A correlation between two variables is used to develop a prediction equation,
with these predictions being based on a linear relationship between variables
If the relationship is curvilinear, other techniques, such as trend analysis, must
be used
If the correlation between two variables is perfect (+1 or −1), we could make a
perfect prediction about the score on one variable, given the score on the other
variable
Of course, we never get perfect correlations, so we are never able to make
perfect predictions
60 Simple Linear Regression…