Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Statistical Methods for Continuous

Variables
Correlation and Regression

1
Contents of the chapter
1. Simple linear Correlation

2. Simple linear regression

3. Multiple linear regression

2
1. Correlation Analysis
• Correlation is the method of analysis to use when
studying the possible association between two
quantitative variables.

• It measures the degree of linear relationship between


two variables.

• Population Correlation Coefficient represented by: ρ

• Sample Correlation Coefficient represented by: r

• r takes any value from –1 to +1, ‘r’ is a dimensionless


3
Correlation…
• r is positive if higher values of one variable are
associated with higher values of the other variable and

• r is negative if one variable tends to be lower as the


other gets higher.

• A correlation of around zero indicates that there is no


linear relationship between the values of the two
variables.
• r is a measure of the scatter of the points around an
underlying linear trend: the greater the spread of the
points the lower the correlation.
4
Description of Relationships
• Bivariate relationships can be described in terms of their:
Strength
– All points close to line (strong – near 1) or not close to
line (weak near 0)
Magnitude
– A higher magnitude is shown in a steeper line
(absolute value of slope shows magnitude of
relationship)
Direction
– Positive: increase of Y = increase of X
– Negative: increase of Y = decrease of X

5
Scatter Diagram
• A two-dimensional scatter plot is the fundamental
graphical tool for looking at correlation and regression
data.

• In correlation and regression the scatter plot of the


response versus the predictor is the starting point for
correlation and regression analysis.

6
This line shows a perfect linear relationship between two variables.
It is a perfect positive correlation (r = 1).

7
Y

A weak positive correlation (r might be around .40)


8
No linear association between variables (r ~ 0)

9
10
11
12
• One of the reasons for producing scatter plots of
data as part of the initial analysis is to identify
nonlinear relationships when they occur,
otherwise, if the correlation coefficient is
calculated without examining the data, one can
miss a strong, but nonlinear relationship

13
Nonlinear relationships
120

100
150

80 100

50
60
0
0 10 20 30 40 50
40 -50

-100
20
-150

0
0 10 20 30 40 50

-20 14
• Growth patterns frequently follow a
sigmoid curve (age Vs growth)
120

100

80

60

40

20

0
0 10 20 30 40 50

• Growth at the start is slow


• It then speeds up
• Slows down again as it reaches it limiting size
15
Pearson Correlation Coefficient
– After Karl Pearson (1857 – 1936)
• r = covariability of X and of Y divided by
variability of X and Y separately

 x i  x y i  y 
r i 1
n n

 i
x  x 2

 i
y  y 2

i 1 i 1
Example

• Calculate the correlation coefficient


between total/HDL cholesterol (X) and
mean change in vessel diameter (Y) of
18 hypothetical patients.
Patient X Y
1 6.8 0.13
2 5.3 0
3 6.1 -0.18
4 4.3 -0.15
5 5.0 0.11
6 7.1 0.43
7 5.5 0.41
8 3.8 -0.12
9 4.6 0.06
10 6.0 0.06
11 7.2 -0.19
12 6.4 0.39
13 6.0 0.30
14 5.5 0.18
15 5.8 0.11
16 8.8 0.94
17 4.5 -0.07
18 5.9 -0.23
18

 x
i1
i  x yi  y   3.6668
18

 x
i1
i  x   24.6378
2

18

 i
y  y 2
 1.4586
i1

3.6668
r  0.61
24.6378 1.4586
Strength of relationship
Absolute value of Strength of
correlation (r) relationship
r < 0.25 No relationship
0.25 < r < 0.5 Weak relationship
0.5 < r < 0.75 Moderate relationship
r > 0.75 Strong relationship
Assumptions in correlation:

1. For each values of X, there is normally distributed


population of Y values.
2. For each values of Y, there are normally distributed
population of X values.
3. Joint distribution of X&Y values are normally
distributed (bivariates).
4. Respective population of X&Y values have
homogeneous variances.

21
Hypotheses Testing for Pearson Correlation
Coefficient:
1. Calculate relevant data from the sample (r).
2. State and confirm the assumptions
3. State the hypothesis and the level of significance.
Ho:  = 0.
HA:  ≠0.
4. Calculate the value of the test statistic.

r n2
t 
1 r 2
5. Make decisions and conclusions as usual.
22
Significance Test for Pearson
Correlation
• H0:  = 0 Ha:   0 (Can do 1-sided test)
• Test Statistic:

r r n 2
tobs  
(1r 2 )/(n2) 1r2
With n-2 degree of freedom
• P-value: 2P(t|tobs|)
example
• Recall the example discussed above
• r= 0.61 and n=18, =0.05
0.61 18  2
t  3.08
1 (0.61) 2

• t0.025, 16 = 2.12
• Concl.: Reject the null hypothesis, i.e., the
relationship is significant
Example 2
• A researcher believes that there is a linear
relationship between BMI (Kg/m2) of pregnant
mothers and the birth-weight (BW in Kg) of their
newborn
• The following data set provide information on 15
pregnant mothers who were contacted for this study.
a) Test if there is relationship between BMI of mothers and BW
of infants at α=0.05
b) Display the association using the scatter diagram.
c) Present the result of the output using text presentation

25
BMI (Kg/m2) Birth-weight (Kg)
20 2.7
30 2.9
50 3.4
45 3.0
10 2.2
30 3.1
40 3.3
25 2.3
50 3.5
20 2.5
10 1.5
55 3.8
60 3.7
50 3.1
35 2.8
26
Presenting correlation results
Example
The relationship between mothers BMI and birth
weight of infants was investigated using Pearson
Correlation coefficient. Preliminary analyses were
performed to ensure no violation of the assumptions of
normality, linearity, and homogeneity of variances.
There was a strong positive correlation between the
two variables, r= 0.907, n=15, p< 0.001.

27
Steps in using SPSS to run simple linear
correlation
• After Importing your dataset, and providing names to
variables, click on:
– Analyze  Correlate  Bivariate
• Select the Variables (both) in to variables
• Click on Options

28
When not to calculate r:
• It may be misleading to calculate r when:
– a non-linear relationship between the two
variables
– the data include more than one observation on
each individual.
– an outlier in the data set
– the data comprise sub-groups of individuals for
which the means levels of the observations on at
least one of the variables are different.

29
Limitation of Correlations:
1) It quantifies only the strength of the linear
relationship between two variables

2) It is very sensitive to outlying values, and thus can


sometimes be misleading

3) It cannot be extrapolated beyond the observed ranges


of the variables

4) A high correlation does not imply a cause-and-effect


relationship
30
2. Regression

Regression

Simple LR Multiple LR Logistic R Cox Regression

Simple LR Ordinal LR Multinomial LR

31
Application of Regressions
• Response Model Type
– Continuous Linear regression
– Counts Poisson regression
– Survival times Cox model
– Binomial Logistic regression
• Uses
– Control for potentially confounding factors
– Model building , risk prediction

32
2.1 Simple linear regression (SLR)
 A way of predicting an outcome variable from one
predictor variable (simple regression).
 Is the straight line relationship between single
dependent and single explanatory variable.
 Focuses on how Y on average changes for a unit of
change of X and how one can predict Y based on X.
 Used when the goal is to predict the value of one
characteristic from knowledge of another
 Assumes a straight-line, or linear relationship between
two variables

33
• The linear relationship between Ŷ and X is
explained by the equation:

Ŷ = a + bx .

Where: a- intercept
b- slop of the line
x- the explanatory variable
Ŷ - the outcome variable.
34
Simple Linear Regression

35
The Simple liner regression
by the method of least squares, which involves minimizing the sum of
squared deviations between the observations and the regression line,
i.e. minimizing the residuals

36
Regression
Can we fit a straight line to this ?
y = a + bx
140

130

120
SBP

110
y = 84.2 + 0.7x
100 r = 0.68
r2 = 0.46
90

80
10 20 30 40 50 60 70
Weight (Kg)

37
• The equation Ŷ = a + bx is a point estimate of Y =
α + βX of the population.
– b= the sample regression coefficient of Ŷ on X.
– β= the population regression coefficient of Y on X.
– Y on X means Y is the Dependent variable (DV) and X is
the independent variable (IV).
• +ve slope: The increase in the value of the DV (Ŷ) is
accompanied by increase in the IV(X)
• -ve slope: The decreases in the value of the DV (Ŷ) is
accompanied by decrease in the IV (X).

38
Graphical display for data (SBP vs Age)

39
Assumptions in SLR
1)There is a linear relationship between the dependent(Y)
and the independent(X) variable.
2)The observations in the sample are independent.
3) For each value of X, there are normally distributed
values of Y in the population.
4) The X variables can be measured without error (no
assumptions required).
5) The errors are normally distributed with zero mean and
a constant standard deviation, also known as
Homoscedasticity (Greek word for “equal scatter”)

40
Steps in Hypotheses Testing using SLR
1) Describe the data and calculate the relevant values.
2) Check the assumptions
3) State the hypotheses:
Ho:  = 0 (i.e. X is not useful as a predictor of Y or
knowing X does not provide any information about
Y or there is no linear relationship between X and Y)

HA:  ≠0 ( i.e. X is a predicator of Y, Or X may provide some


information about Y, Or there is a linear
relationship between X and Y)

41
4) Test statistic:
– The test statistic is Z = b –  , if σ is known.
b
– t = b- for Ho: =0, and when  = 0, t= b/sb
sb

5) Make decision and conclusion accordingly

42
Steps in using SPSS to run simple linear
regression
• After Importing your dataset, and providing names to
variables, click on:
– ANALYZE  REGRESSION  LINEAR
• Select the DEPENDENT VARIABLE
• Select the INDEPENDENT VARAIABLE(S)
• Click on STATISTICS, then ESTIMATES, CONFIDENCE
INTERVALS, MODEL FIT
• For histogram of residuals, click on PLOTS, and
HISTOGRAM under STANDARDIZED RESIDUAL PLOTS

43
Example 1.2.1: Using the data file ’data for exercise 1’ assess the
effect of age on BMI of the respondents at level of
confidence of 95%. Use SPSS.

a) Calculate regression coefficinets (a&b)


b) test Ho:  = 0
Ha:  ≠0
c) Write the regression model
d) Estimate the BMI of a person who is 50 years old

44
Let’s use SPSS to run the simple linear regression:
• Steps:
ANALYZE  REGRESSION  LINEAR
i. Select the dependent variable (BMI) and move to the
dialog box named DEPENDENT
ii. Select the independent variable (age) and move to the
dialog box named INDEPENDENT
iii. Click on STATISTICS, then, choose: ESTIMATES,
CONFIDENCE INTERVALS, MODEL FIT and DURBIN-
WATSON under RESIDUALS=>continue
iv. For checking the assumption of normality, click on
PLOTS, and move ZPRED to X-axis and ZRESD to Y-axis
and Select NORMAL PROBABILITY PLOT under
STANDARDIZED RESIDUAL PLOTS =>continue=> ok
v. Interpret the out put on the next pages.
45
Interpretation of the SPSS out puts: The table below shows whether
the model is fit or not. We conclude the model is fit if R-square is >
0.1 (10%). In this example the R2 =0.256 and is greater than 0.10.
Hence we conclude that the model is fit.

46
The table below provides out puts from ANOVA which if significant,
shows an overall association between the dependents and
independent variables. The p.value (Sig) is 0.000, usually reported as
p.value< 0.001, implying that there is a strong association between
age and BMI of respondents.

47
We draw our conclusion from table of ‘coefficients’ indicated
below. The p.value corresponding to the independent variable, age
is 0.000, which is reported as <0.001. Hence, we reject the Ho: β=0.
The result shows that age is a strong predictor of BMI (p<0.001).
Y=a+bX, Y=17.004+0.218X

48
The minimum and maximum values of standardized residual need to
be examined to identify presence or outliers. If we fix the cut-off
points as ±3, we can see that the minimum value (-2.637) and
maximum value (2.902) are included within the set cut off values.
Hence we conclude that there are no outliers and thus, the
assumption of normality is satisfied for this example.

49
The Normal P-P Plot provides information regarding if the residuals
are normally distributes, which is one of the assumptions require for
simple linear regression. If the dots follow the diagonal straight line
and are closer to it, that shows the data are normally distributed. In
this example, we can conclude that the data are approximately
normally distributed.

50
The scatter plot below is plotted for standardized residual. We can see that all the
values fall within ±3, and thus the data are normally distributed. More over the
scatter plot provides information about homogeneity across the distribution. Hence,
we conclude that there are no outliers and the variability of residuals across X-axis
are uniform and thus, the assumption of normality and homogeneity of variance are
satisfied for this example.

51
1.3. Multiple Linear Regression(MLR)
– is an extension of the most fundamental model
describing the linear relationship between two
variables (SLR).
– used to measure and describe the function relating
two (or more) predictors (independent) variables
to a single response (dependent) variable
– The general linear equation is:
Yfit = a + b1X1 + b2X2 + . . . + bnXn (sample)
y =α+β1x1 +β2x2 + +βkxk (population)
Example
– SBP versus age, weight, height, etc
52
Assumptions of MLR:
a) The relationship between the dependent and
independent variables is linear.
– We can examine this assumption for any variable, by plotting
(i.e., by using bivariate scatter plots) the residuals (the
difference between observed values of the dependent
variable and those predicted by the regression equation)
against that variable.

– Any curvature in the pattern indicate a non-linear relationship


and may suggest-transformation of the explanatory variable
may be considered.

53
b) We can produce a Normal plot of the residuals, to check
the overall fit and verify that the residuals have an
approximately normal distribution. The normal plot may
identify outliers for further investigation.

c) We can plot the residuals against the fitted values. No


pattern should be discernible. In particular, the variability
of the residuals should be constant across the range of the
fitted values.

d) The observations (explanatory variables) should be


independent.
54
Hypotheses Testing for MLR
1) Describe the data using ANOVA table.
Source of variation SS DF MS VR
Due to Regression SSR K SSR/K MSR/MSE
About regression(error) SSE n-k-1 SSE/n-k-1

2) Check assumptions
3) State the Hypotheses:
Ho: 1=2=3=……=k= 0
HA :Not all i = 0
4)Test statistic:
F= MSR/MSE

55
5)Decision and Conclusion:
• Decision:
– Reject Ho if Fc> Ft at specified degrees of freedom.
• Conclusion:
– The dependent variable is linearly related to the
independent variables as a group in the population.

56
Major types of Multiple regression:
1) Stepwise Regression
• A technique for choosing predictor variables from a
large set.
• Applied to:
– multiple linear regression
– logistic regression and
– Cox regressions.
• There are two basic strategies of applying this
technique:
– forward and
– backward stepwise regression
57
Forward stepwise regression :
• Begins with examine the simple relation between each
potential explanatory variable and the outcome
variable.

• Steps in applying this method are:


i) Find the single variable that has the strongest association
with the dependent variable and enter it into the model
(i.e., the variable with the smallest p-value).

58
ii) Find the variable among those not in the model
that, when added to the model so far obtained,
explains the largest amount of the remaining
variability.

iii) Repeat step (b) until the addition of an extra


variable is not statistically significant at some
chosen level such as P=.05.

59
Backward stepwise regression :
 we believe all the variables are potentially important
explanatory variables.
 Therefore, we should fit the full model, including all of these
variables, and then remove unimportant variables one at a
time.
 We use the same criterion, say P<.05, to determine
significance.
 At each step we remove the variable with the smallest
contribution to the model (or the largest P-value) as long as that
P-value is greater than the chosen level.

What do you do when you have a lot of independent variables


(say, 30 or more)?
(Hint: start with the classical bivariate analysis)
60
2) Hierarchical/sequential Multiple regression
– Predictors are selected based on past works
– The IV are entered into the equation in the order specified
by the researcher
– Sets of variables are entered in steps or blocks
– The variable for which the researcher wants to control is
entered into block 1 of the model
3) Forced Entry(Standard/Enter method)
– all predictors are forced into the model simultaneously
– relies on good theoretical reasons for including the chosen
predictors
– the experimenter makes no decision about the order in
which variables are entered

61
Multi-Collinearity:
• A statistical concept where several independent
variables in a model are correlated.
• this problem occurs it means that at least one of the
predictor variables is completely redundant with other
predictors.
• Two variables are considered to be perfectly collinear if
their correlation coefficient is +/- 1.0.
• Multicollinearity among independent variables will result
in less reliable statistical inferences.
• There are many statistical indicators of this type of
redundancy
– Diagnostic checks such as VIF must be done to
ensure that this problem is avoided
62
Multicollinearity…
• So either a high VIF or a low tolerance is indicative of
multicollinearity.
– A variance inflation factor (VIF) is a measure of
the amount of multicollinearity in regression
analysis.
– Small VIF values, VIF < 3, indicate low correlation
among variables under ideal conditions.
– The default VIF cutoff value is 5; only variables with a
VIF less than 5 will be included in the model.
– However, note that many sources say that a VIF
of less than 10 is acceptable.

63
Multicollinearity…
• It is calculated by taking an independent
variable and regressing it against every other
predictor in the model.

64
Selection of Independent variables (IVs)
How or which variables are should be selected out of
many variables collected?
1) The IVs could be selected based on researcher’s expertise
knowledge and previous studies
2) IVs could also be selected first by conducting bivariate (1 IV Vs
the dependent variable) analysis, and select those provided a
p.value of 0.25 or less for multivariable analysis. This is usually
done if sample size is small
3) The rule of thumb to decide the number of IV depends on
many several factor.

65
Presenting results of MLR

Independent beta values (B) with SE P. Value


variables (Ivs) 95%CI (optional)

66
Example 1.3.2
In a study of factors thought to be related to admission patterns
to a hospital, the hospital administrator obtained the following
data on 10 communities in the hospital’s catchment area.
From these data:

a) Obtain the regression equation


b) Test the Ho: β1=β2=0 at α=0.05
c) Calculate the 95%CI for population coefficients.

67
Community Admission Availability of other Index of
(persons/1000) health services indigence
1 61.6 6.0 6.3
2 53.2 4.4 5.5
3 65.5 9.1 3.6
4 64.9 8.1 5.8
5 72.7 9.7 6.8
6 52.2 4.8 7.9
7 50.2 7.6 4.2
8 44.0 4.4 6.0
9 53.8 9.1 2.8
10 53.5 6.7 6.7
Total 571.6 69.9 55.6

68
Solutions
• Steps:
ANALYZE  REGRESSION  LINEAR
i. Select the dependent variable (baseline FBS) and move
into the dialog box named DEPENDENT
ii. Select the independent variables (age, Annual income and
BMI, all of them quantitative) and move to the dialog box
named INDEPENDENTS
iii. Click on STATISTICS, then, choose: ESTIMATES,
CONFIDENCE INTERVALS, MODEL FIT, Collinearity diagnosis
and DURBIN-WATSON under RESIDUALS=>continue
iv. For checking the assumption of normality, click on PLOTS,
and move ZPRED to X-axis and ZRESD to Y-axis and Select
NORMAL PROBABILITY PLOT under STANDARDIZED
RESIDUAL PLOTS =>continue=> ok
v. Interpret the out put on the next pages 69
Interpretation of the SPSS out puts: The table below
shows whether the model for multiple linear regression is
fit or not. We conclude that the model is fit if R-square is >
0.1 (10%). In this example the R2 =0.356 and is greater
than 0.10. Hence, we conclude that the model is fit and
thus, can predict the dependent variable

70
The table below provides out puts from ANOVA which if
significant, shows an overall association between the one
dependent and more than one independent variables. The
p.value (Sig) is 0.000, usually reported as p.value< 0.001,
implying that there is a strong association between the
dependent (baseline FBS) and independent variables (age, BMI
and annual income) of the respondents.

71
Based on the SPSS out puts in the following table, we can observe
that the p.value corresponding to the independent variables: age
and annual income is 0.013 and 0.005, respectively. Hence, we
reject the Ho: β1= β2 = β3 =0. The result shows that age and annual
income are independent predictors of FBS. .

72
The minimum and maximum values of standardized residual are
contained within the cut-off points of ±3. The minimum value is -
1.962) and the maximum value (2.171). Hence, we conclude that
there are no outliers and thus, the assumption of normality is
satisfied for this multiple regression model.

73
The Normal P-P Plot provides information regarding if the residuals are
normally distributes, which is one of the assumptions require for multiple
linear regression. If the dots follow the diagonal straight line and are closer to
it, that shows the data are normally distributed. In this example, we can
conclude that the data are approximately normally distributed.

74
From the scatter plot below, we can see that all the values fall within ±3, and thus the
data are normally distributed. Hence, we conclude that there are no outliers and the
variability of residuals across X-axis are uniform and thus, the assumption of normality
and homogeneity of variance are satisfied. The distribution of residuals are random
showing that the data are independent. We can also draw a straight line representing
the association implying that the relationship of the dependent and independent
variables is linear.

75

You might also like