Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 60

Chapter 2: Correlation and Regression analysis

By
Abebe Feyissa
1
(PhD Candidate, Epidemiology & Biostatistics)
2 Content
 Objectives of this chapter
 Part I: General introduction to correlation and
regression analysis
 Part II: Correlation analysis
 Part III: Regression analysis
3 Objectives of this Chapter
 After studying this topics, the student will be able to:
1. Explain when to use correlation and regression techniques to answer research questions or
test hypotheses
2. Understand assumptions behind the correlation and regression analysis
3. Understand the principles and procedures of Pearson correlation and regression coefficients
4. Determine whether Pearson correlation and regression coefficients are statistically
significant
5. Use SPSS to compute the Pearson correlation and regression coefficients and correctly
interpret the outputs
6. Report a correlation and regression coefficients in terms of the direction and strength of
association and its statistical significance
4

Part I: Introduction
5 Introduction
 Variable classification
 It is useful to select which model of data analysis to be used
 There are three methods of variable classification
1. Gap (continuous vs discrete)
2. Descriptive (predictive property): dependent vs independent
3. Level of measurement (nominal, ordinal, interval, ratio)
6 Introduction…

 Choice of analysis model


 Strong rational require for choosing a particular method of data analysis
 There are four methods to choice particular model for data analysis
Aim of the study
Mathematical characteristics of the variables involved to the analysis
(relationship, level of measurement)
Statistical assumptions made about the variables included
Method of data collection (Sampling procedure, response recording methods)
7 Introduction…
 Correlation and regression analysis are one of statistical analysis
methods which tell us:
The relationship between variables
 Nature and the strength of the relationship
 Predicting a given value from the pre-known value based on
prediction model (regression)
 Correlation tells how well the estimating equation actually describes
the relationship
8 Introduction…
 Simple linear regression or bivariate relationships
 It conducted the relation between two random variables
 The known variable is independent variable/explanatory
variable/predictor
 Variable to be predicted is dependent variable/outcome variable
 Multiple Regression describes the process by which several variables
are used to predict one variable
 We shall develop an estimating equation that relates the independent
variables to the dependent
9

Part II: Correlation


10 Objectives of correlation analysis

 After attending this class, the student will be able to:


 Explain when to use correlational techniques to answer research questions or test
hypotheses
 Choose between the Pearson and Spearman’s correlation coefficients
 Hand compute the Pearson correlation coefficients and determine whether they are
statistically significant
 Use SPSS to compute the Pearson correlation coefficients and correctly interpret the
output
 Report a correlation coefficient in terms of the direction and strength of association and
its statistical significance
11 Introduction to Correlation Analysis
 Correlation measures a linear association between two continuous variables
 Its techniques are used to study the relationship of two different variables, or
the same variable can be measured at two different points in time
 There is no difference in terms of variable type (dependence vs independence)
 A correlation coefficient (r) provides a measure of the strength and direction of
an association between two variables
 A p-value is calculated to provide an assessment of the statistical significance
of that association
12 Introduction to Correlation Analysis…

 There are two types of correlation coefficients. Both of them measure association
 The Pearson correlation coefficient is used when both variables are normally
distributed
 It is a parametric test that measures the association of two variables at the
interval or ratio scale
 The Spearman correlation coefficient is used when one or both of the variables
are not normally distributed
 It is a nonparametric test that measures the association of two variables at the
ordinal, interval, or ratio scale
13 Introduction to Correlation Analysis…

 Hypothesis stating

 As with most statistical tests, first we have to state the null and alternate
hypotheses

 When using correlation coefficients, the two-sided hypotheses could be:

 H0: There is no correlation between the two variables (r = 0)

 H1: There is a correlation between the two variables (r ≠ 0)


14 Introduction to Correlation Analysis…

 Example
 Within a population of diabetic patients, two variables have been measured:
 Fasting blood glucose (mmol/lt)
 Mean circumferential shortening velocity of the left ventricle (Vcf)
(%/sec)
 We want to study the relation between these two variables
15 Introduction to Correlation Analysis…
 Example...
 What is the relationship between fasting blood glucose and mean circumferential
shortening velocity of the left ventricle?
 Graphically (Scatter plot)
 Numerically
Correlation (Correlation coeficient)
Linear regression (Regression coefficient)
 How do you measure the strength of the relationship?
 Correlation coefficient or regression coefficient
16 Assumptions of Pearson correlation

1. The two variables are on either the interval or ratio measurement scales
2. The two variables are normally distributed
3. The two variables are related to each other in a linear fashion (straight
line)
4. There are no outliers (variables that fall outside of the pattern of the rest of
the data)
17 When A Pearson correlation coefficient can be
used?

 The study participants constitute an independent random sample


(sampling should be random)

 There are two variables to be compared

 The two measures are normally distributed

 The two measures are of interval or ratio measurement scale

 The two variables have a linear relationship


18 When A Pearson correlation coefficient can be
used?...

 There are not influential outliers

 For each value of one variable, the distribution of the other variable is normal

 For every value of the first measure (X), the distribution of the second
measure (Y) must have equal variance, and for every value of Y, the
distribution of X must have equal variance (This is called the assumption of
homoscedasticity)
19 Scatter plot

 Relationship between Vcf and blood


sugare level in diabetc patients 1,80

graphically

vcf
1,50

1,20

6,0 9,0 12,0 15,0 18,0


glucose
20 Mathematical presentation of correlation

 We can measure covariance, but


covariance depends on the unit used for
both variables covariance (X, Y) 
( X i  X )(Yi  Y )
n 1
 Correlation is unit less

r
 ( X  X )(Y  Y )
i i

 ( X  X )  (Y  Y )
i
2
i
2
21 Mathematical properties of correlation
Coefficient (r)

 Correlation coefficients are denoted by r

1. The possible values of r range from -1 to 1

2. r is a dimensionless quantity; that is, r is independent of the units of


measurement of X and Y

3. r is positive, negative, or zero as ß1 is positive, negative, or zero; and vice


versa
22 Measuring Strength and Direction of
Association
 r = +1 means there is a perfect positive relationship. As one variable
increases, the other variable also increases
 r = −1 means there is a perfect negative or inverse relationship between
the two variables. As one variable increases in value, the other decreases
 r = 0 means there is no relationship (no association) between the two
variables. An increase in one variable is not associated with a predictable
change in the other variable
 The closer the correlation coefficient is to zero, the weaker the
association between the two variables
23 Measuring Strength and Direction of
Association…
 Direction and strength of association

r=0 r = -1
r = +1
24 Measuring Strength and Direction of
Association…
 The square of the correlation coefficient (r2), tells us how much variation is shared by the
two variables
The quantity of r2 is varies between 0 and 1, since r is varies between -1 and 1

 As illustrated by the Venn diagram below, the shaded area shows the shared variance (r2)
between variables A and B, and the clear area shows the unique variance of each variable
25 Measuring Strength and Direction of
Association…

 How large should r2 be in practice?

 There are no strict statistical guidelines for deciding whether or not


a particular r2 value is large enough, since such a decision will often
depend on the research question under study
26 Test of hypothesis and ch/x for ρ

 ρ is a population correlation coefficient of the r (sample correlation coefficient)


 Hypothesis
 H0: ρ = 0
 HA: ρ ≠ 0
 Test statistic is given by

 It has the t distribution with n-2 degrees of freedom when the null hypothesis is
true
27 Test of hypothesis and Ch/x for ρ

If the distribution of the sample correlation coefficient r is not normal, we use
Fisher’s z transformation to determine confidence interval
1 r 1
z  0.5 ln( ) is normally distributed with standard error
1 r n -3
Make a 95% CI for z: [z1; z2]   Z1 = 0.5ln ± 1.96

 Z = 0.5ln ± 1.96
2

e 2 z1  1 e 2 z2  1
And use the back transformation to determine the CI for r: , 2 z2
e 1 e 1
2 z1
28 Test of hypothesis and Ch/x for ρ
  
 Example: If sample correlation coefficient between systolic blood pressure and age
is 0.66, the sample size n = 30 and the ρ = 0.85, calculate the statistic (T) and 95%
confidence interval for ρ, then decide whether it is significantly correlated or not

 = = 4.62
 Z = 0.5ln ± 1.96
 95% CI for Z = 0.793 ± 0.377 = (0.416, 1.17)

e 2 x 0.416  1 e 2 x1.17  1
, 2 x1.17
e 2 x 0.416
1 e 1
95% CI for ρ = (0.394, 0.824)
29 Steps of Pearson correlation analysis

1. State the hypothesis


2. Define the significance level
3. Make sure the data meet all the necessary assumptions
4. Present a scatter plot of the two variables to assess the relation to
each other in a linear fashion
5. Obtain the correlation coefficient between the two variables
6. Determine the statistical significance and state a conclusion
30 Illustration the steps of Pearson correlation
analysis

 Let’s say we are interested in predicting who will do well in biostatistics


exam. So, we give everybody a screening test at the start of the class
(scored from 0 to 100) and then we get everyone’s final numerical grade
(scored from 0 to 10). How could we tell if the screening test was any
effect on the final exam?
31 Illustration the steps of Pearson correlation
analysis…
 We determine the effect on screening test by Pearson correlation coefficient with
the following question:
 Is the screening exam positively associated with the final statistics grade of 10
students registered for MPH in Salale University? That is, do students who do
well scored on the screening exam also do well in the final exam?
 To answer this question, we will use data from 10 graduate students. Each student
took the screening test before starting the biostatistics class
 We have measured:
 The dependent variable: statistics grade (ratio scale)
 The independent variable: screening test score (ratio scale)
32 1. Hypothesis stating

 H0: The test scores on the screening test are not correlated with
the final course grade (r = 0)

 H1: The test scores on the screening test are correlated with the
final course grade (r ≠ 0)
33 2. Define the Significance level

 Define the level of significance: We choose a two-sided test with


an α-level of 0.05
 Calculate the degrees of freedom
 df = (n − 2) = (10−2) = 8
 2 is the number of variables included in the analysis and n is
sample size
34 Define the Significance level…
 Find the critical value in the Pearson correlation
table for a two-tailed test with 8 degrees of
freedom and alpha = 0.05: Tcrit = 0.632
 Let calculated T = 0.893
 If calculated T is greater than the Tcrit, reject the
null hypothesis at alpha = 0.05
 0.893 greater than 0.632
 Thus, we fail to accept H0
35 3. Check That the Data Meet All the Assumptions

 There are two variables to be compared

 The two variables are normally distributed

 The two variables are ratio measurement scale

 The two variables have a linear relationship (see the scatterplot)


36 4. Create a scatter plot

 From this scatterplot, although not


perfect, we can see that there is a
positive correlation between the two
variables
37
5. Obtain the Correlation Coefficient

 The r (correlation coefficient) is 0.617. The r2 = 0.38


38 6. Determine Statistical Significance and
State a Conclusion
 The correlation between the screening test and the final statistics grade is 0.617
(r2 =0.38)
 Since, the p-value is greater than 0.05 (p = 0.057), so this correlation is not
statistically significant
 Conclusion: Screening exam is not significantly associated with the final
biostatistics grade. Or students who score high mark on the screening exam do
not well in final exam
39 SPSS procedure for Pearson Correlation

 Steps to conduct correlation analysis using SPSS:


 Analyze > correlate > bivariate > push the two variables to the
variable dialog box > ok
40 Summary

 Correlation is a procedure for quantifying the relationship between two or more


variables
 Correlation measure the strength and direction of the relationship
 Pearson correlation is used when looking at the linear relationship between two
normally distributed variables of interval or ratio scale
 Spearman correlation can be used when the relationship is not linear (but is
monotonic) and when one or both variables are not normally distributed or are
of ordinal scale
41 Question
 The Pearson correlation coefficient is best used to determine the
association of
a. two ratio variables to each other.
b. three or more ratio variables to each other.
c. two nominal variables to each other.
d. three or more ordinal variables to each other.
 The Spearman correlation coefficient should be used instead of the
Pearson correlation coefficient when
a. neither of the variables is normally distributed.
b. one of the variables is normally distributed.
c. both of the variables are normally distributed.
d. both a and b.
42

Part III: Regression analysis


43 Objectives of regression analysis

 After studying this topic the student will be able to:

1. Know when it is appropriate to use linear regression

2. Know the concept of regression model with single independent variable/simple


regression model

3. Explain the difference between testing the significance of R2 and the significance
of a regression coefficient (ß)

4. Discuss methods for selecting variables for entry into a linear regression model
44 Objective of regression analysis

5. Know the mathematical properties of a straight line

6. Know the statistical assumption for a straight line model and describe
testing regression assumptions

7. set up and solve a prediction equation

8. Know how best-fitted straight line determined

10. Know how the tests of slope and intercept interpreted


45 Introduction

 Correlation tells you if there is an association between X and Y but it doesn’t


describe the relationship or allow you to predict one variable from the other
 To do this REGRESSION ANALYSIS required!

 Regression: technique concerned with predicting some variables by knowing


others
 The process of predicting variable Y (dependent or outcome variable) using
variable X (independent or explanatory or predictor variable)
46 Introduction…
 Correlation describes the strength and direction of a linear relationship between
two variables
 Linear means “straight line”
 Regression tells us how to draw the straight line described by the correlation
 It calculates the “best-fit” line for a certain set of data
 The regression line makes the sum of the least-squares of the residuals smaller
than for any other line
 Regression minimizes residuals/errors
47 Introduction…

 Types of data required


 The primary data requirement for linear regression is that the dependent variable
should be ratio scale and normally distributed
 If the dependent variable is not normally distributed, we can transform
mathematically to make it more normal
 Common transformation method is to the natural log (base e) of the dependent
variable
 The independent variables can be of any scale
 Nominal independent variables that have more than two categories have to be put in
the model as dummy variables
48 Introduction…

 Applications of regression
1. To characterize the relationship between dependent and independent
variables by determining the extent, direction and strength of association
2. To determine equation that describe the dependent variable (Y) as a
function of independent variable(s)
3. To describe the relationship between dependent and independent
variables while controlling the effect of other variables (confounders)
4. To determine which of the independent variables are important and
which are not for describing or predicting a dependent variable
49 Introduction…

5. To describe the best mathematical model for describing the relationship between
variable and one or more independent variables

6. To compare several derived regression relationships

7. To assess the interactive effects of two or more independent variables with regard
to a dependent variable (eg. To know the relationship of alcohol consumption to
blood pressure level is different depending on the smoking habits)

8. To obtain a valid and precise estimate of one or more regression coefficients


Introduction…
  Best-fit
Line  
 Aim of linear regression is to fit ŷ = ßx +
a straight line, ŷ = ßx + , to data slope intercept
that gives best prediction of y
for any value of x
ε
 This will be the line that
minimises distance between
data and fitted line, i.e. the
residuals
= ŷ, predicted value
= yi , observe value
ε = residual error
51 Introduction…
 Simple versus multivariate regression
 A simple linear regression model, sometimes also referred to as univariate or
bivariate, models the association of one dependent variable with one independent
variable
 Simple regression models show us the crude, or unadjusted, association between
the two variables
 A multivariate linear regression (MLR) model shows the relationship between the
dependent variable and multiple independent variables
 The overall variance explained by the model as well as the unique contribution
(strength and direction) of each independent variable can be obtained
52 Introduction…

 Linear regression models use the correlation between variables and the notion of a
straight line to develop a prediction equation
 In simple linear regression (one dependent variable and one independent
variable), the relationship can be graphed as a line
 In multiple linear regression (one dependent variable and multiple independent
variables), the shape is not really a line
 If there are three variables, the shape is a plane, and if there are four or more
variables, it is impossible to visualize or graph
 However, by convention, we still refer to the regression equation as a regression
'line'
53 Linear Regression model assumption
  
1. Existence: for any fixed value of the variable X, Y is a random variable with
certain probability distribution having finite mean and variance

 The population mean of this distribution will be denoted and the population
variance as

 The notation Y/X indicates that the mean and variance of the random variable Y
depend on the value of X
54 Linear Regression model assumption…

2.  Independence: The Y-values are statistically independent of one another (the value
of one participant must not be influenced by another participants)
3. Linearity: The mean value of Y, is a straight line function of X
Y = a + bx or = X
 Where and are the intercept and slope of the straight line respectively
 The above equation can be expressed as:
=X+
55 Linear Regression model assumption…

  Error component is the


distance between observed
Y and population or
estimated Y
 Y=X+
 =Y-X
 X)
 =Y-
56 Linear Regression model assumption…

4. Homoscedasticity: The variance of Y is the same for any X (Homo-


means same and scedastic means scattered)
57 Linear Regression model assumption…

5. Normal distribution: For fixed value of X Y has a normal distribution

 This assumption is used to evaluate the statistical significance by confidence


interval and tests of hypothesis of the relation between X and Y
58 Simple Linear Regression

 The simple (one independent variable) regression equation is the equation for
a straight line and is written as
Y′ = a + bX
 Y′ is the predicted score for the dependent variable
 Letter a in the model is called the intercept constant, also referred to as alpha,
and is the value of Y when X = 0
 Letter b represents the regression coefficient, also referred to as beta, and is
the rate of change in Y with a unit change in X. It is a measure of the slope of
the regression line
59 Simple Linear Regression…
 A correlation between two variables is used to develop a prediction equation,
with these predictions being based on a linear relationship between variables
 If the relationship is curvilinear, other techniques, such as trend analysis, must
be used
 If the correlation between two variables is perfect (+1 or −1), we could make a
perfect prediction about the score on one variable, given the score on the other
variable
 Of course, we never get perfect correlations, so we are never able to make
perfect predictions
60 Simple Linear Regression…

 The greater the correlation, the more accurate the prediction


 If there is no correlation between two variables, knowing the score of one
would be absolutely no help in estimating the score of the other
 When we have no information to aid us in predicting a score, our best guess for
any subject would be the mean (which is also the median for a normally
distributed variable), because that is the center of the data
 To be able to make predictions, the relationship between two variables, the
independent (X) and the dependent (Y), must be measured
 If there is a correlation, a regression equation can be developed that will allow
prediction of Y given X

You might also like