Bio2 Module 4 - Multiple Linear Regression

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 20

UNIVERSITY OF GONDAR

Department of Community Health

Multivariate Analysis

Getu Degu

November 2008

1
Multivariate Analysis
 Multivariate analysis refers to the analysis of data that
takes into account a number of explanatory variables
and one outcome variable simultaneously.

 It allows for the efficient estimation of measures of


association while controlling for a number of
confounding factors.

 All types of multivariate analyses involve the


construction of a mathematical model to describe
the association between independent and dependent
variables.

 A large number of multivariate models have been


developed for specialized purposes, each with a
particular set of assumptions underlying its
applicability.

 The choice of the appropriate model is based on the


underlying design of the study, the nature of the
variables, as well as assumptions regarding the inter-
relationship between the exposures and outcomes
under investigation.

2
I) Multiple linear regression
Multiple linear regression (we often refer to this method as
multiple regression) is an extension of the most
fundamental model describing the linear relationship
between two variables.

Multiple regression is a statistical technique that is used to


measure and describe the function relating two (or more)
predictors (independent) variables to a single response
(dependent) variable.

Regression equation for a linear relationship


A linear relationship of n predictor variables, denoted

X1, X2, . . . Xn
to a single response variable, denoted

Y
is described by the linear equation involving several
variables. The general linear equation is:

Y = a + b1X1 + b2X2 + . . . + bnXn

Where:
The regression coefficients (or b1 . . . bn ) represent the
independent contributions of each explanatory variable to
the prediction of the dependent variable.

3
X1 . . . Xn represent the individual’s particular set of values
for the independent variables.

n shows the number of independent predictor variables.

Assumptions
a) First of all, as it is evident in the name multiple linear
regression, it is assumed that the relationship between the
dependent variable and each continuous explanatory
variable is linear. We can examine this assumption for any
variable, by plotting (i.e., by using bivariate scatter plots)
the residuals (the difference between observed values of
the dependent variable and those predicted by the
regression equation) against that variable. Any curvature in
the pattern will indicate that a non-linear relationship is more
appropriate- transformation of the explanatory variable may
be considered.

b) It is assumed in multiple regression that the residuals


should follow a normal distribution and have the same
variability throughout the range.

c) The observations (explanatory variables) are


independent.

4
Predicted and Residual Scores

 The regression line expresses the best prediction of


the dependent variable (Y), given the independent
variables (X).

 However, nature is rarely (if ever) perfectly predictable,


and usually there is substantial variation of the
observed points around the fitted regression line.

 The deviation of a particular point from the regression


line (its predicted value) is called the residual value.

5
Residual Variance and R-square

 The smaller the variability of the residual values


around the regression line relative to the overall
variability, the better is our prediction.
 For example, if there is no relationship between the
X and Y variables, then the ratio of the residual
variability of the Y variable to the original variance
is equal to 1.0.

 If X and Y are perfectly related then there is no


residual variance and the ratio of variance would be
0.0.

 In most cases, the ratio would fall somewhere between


these extremes, that is, between 0.0 and 1.0.

 One minus this ratio is referred to as R-square or


the coefficient of determination.

 This value is immediately interpretable in the following


manner. If we have an R-square of 0.4 then we know
that the variability of the Y values around the regression
line is 1- 0.4 times the original variance.

 In other words, we have explained 40% of the


original variability, and are left with 60% residual
variability.

 Ideally, we would like to explain most if not all of the


original variability.

6
♣ The R-square value is an indicator of how well the
model fits the data

♣ An R-square close to 1.0 indicates that we have


accounted for almost all of the variability with the
variables specified in the model.

N.B. a) The sources of variation are:

i) Due to regression ii) residual (about regression)

b) The sum of squares due to regression (SSR) over the total sum
of squares (TSS) is the proportion of the variability accounted for by the
regression. Therefore, the percentage variability accounted for or
explained by the regression is 100 times this proportion.

7
Interpreting the Correlation Coefficient R

 Customarily, the degree to which two or more


predictors (independent or X variables) are related to
the dependent (Y) variable is expressed in the
correlation coefficient R, which is the square root of R-
square.

▪ In multiple regression, R assumes values between


0 and 1. This is true due to the fact that no meaning
can be given to the correlation in the multivariate
case.

 The larger R is, the more closely correlated the


predictor variables are with the outcome variable.

 When R=1, the variables are perfectly correlated in


the sense that the outcome variable is a linear
combination of the others.

 When the outcome variable is not linearly related


to any of the predictor variables, R will be very
small, but not zero.

8
Choice of the Number of Variables

 Multiple regression is a seductive technique: "plug in"


as many predictor variables as you can think of and
usually at least a few of them will come out significant.
 This is because one is capitalizing on chance when
simply including as many variables as one can think of
as predictors of some other variable of interest. This
problem is compounded when, in addition, the number
of observations is relatively low.

 Intuitively, it is clear that one can hardly draw


conclusions from an analysis of 100 questionnaire
items based on 10 respondents.

 Most authors recommend that one should have at least


10 to 20 times as many observations (cases,
respondents) as one has variables, otherwise the
estimates of the regression line will probably be
unstable.

 Sometimes we know in advance which variables we


wish to include in a multiple regression model. Here it
is straightforward to fit a regression model containing
all of those variables. Variables that are not significant
can be omitted and the analysis redone.

 There is no hard rule about this, however. Sometimes


it is desirable to keep a variable in a model because
past experience shows that it is important.

9
 In large samples the omission of non-significant
variables will have little effect on the other regression
coefficients.

 Usually it makes sense to omit variables that do not


contribute much to the model ( P > .05).

The statistical significance of each variable in the multiple


regression model is obtained simply by calculating the ratio of
the regression coefficient to its standard error and relating this
value to the t distribution with n-k-1 degrees of freedom, where n
is the sample size and k is the number of variables in the model.

Stepwise regression

Stepwise regression is a technique for choosing predictor


variables from a large set. The stepwise approach can
be used with multiple linear, logistic and Cox
regressions. There are two basic strategies of applying
this technique known as forward and backward
stepwise regression.

10
Forward stepwise regression :- The first step in many
analyses of multivariate data is to examine the simple
relation between each potential explanatory variable and the
outcome variable of interest ignoring all the other variables.
Forward stepwise regression analysis uses this analysis as
its starting point. Steps in applying this method are:

a) Find the single variable that has the strongest


association with the dependent variable and enter it
into the model (i.e., the variable with the smallest p-
value).
b) Find the variable among those not in the model
that, when added to the model so far obtained,
explains the largest amount of the remaining
variability.

c) Repeat step (b) until the addition of an extra


variable is not statistically significant at some
chosen level such as P=.05.

N.B. You have to stop the process at some point


otherwise you will end up with all the variables in the
model.

11
Backward stepwise regression :

♣ As its name indicates, with the backward stepwise


method we approach the problem from the other direction.

♣The argument given is that we have collected data on


these variables because we believe them to be
potentially important explanatory variables. Therefore,
we should fit the full model, including all of these
variables, and then remove unimportant variables one at a
time until all those remaining in the model contribute
significantly.

♣ We use the same criterion, say P<.05, to determine


significance. At each step we remove the variable with the
smallest contribution to the model (or the largest P-value)
as long as that P-value is greater than the chosen level.

12
Multicollinearity
 This is a common problem in many correlation
analyses. Imagine that you have two predictors (X
variables) of a person's height:

(1) weight in pounds and (2) weight in ounces.

 Obviously, our two predictors are completely


redundant; weight is one and the same variable,
regardless of whether it is measured in pounds or
ounces.
 Trying to decide which one of the two measures is
a better predictor of height would be rather silly;
however, this is exactly what one would try to do if
one were to perform a multiple regression analysis
with height as the dependent (Y) variable and the
two measures of weight as the independent (X)
variables.

 When there are very many variables involved, it is


often not immediately apparent that this problem

13
exists, and it may only manifest itself after several
variables have already been entered into the
regression equation.

 Nevertheless, when this problem occurs it means


that at least one of the predictor variables is
(practically) completely redundant with other
predictors. There are many statistical indicators of
this type of redundancy.

The Importance of Residual Analysis

 Even though most assumptions of multiple regression


cannot be tested explicitly, gross violations can be
detected and should be dealt with appropriately.

 In particular, outliers (i.e., extreme cases) can seriously


bias the results by "pulling" or "pushing" the regression
line in a particular direction (see the example below),
thereby leading to biased regression coefficients.

 Often, excluding just a single extreme case can yield a


completely different set of results.

14
Example on multiple regression
The following data were taken from a survey of women
attending an antenatal clinic. The objectives of the study
were to identify the factors responsible for low birth weight
and to predict women 'at risk' of having a low birth weight
baby.
Notations: BW = Birth weight (kgs) of the child =X1

HEIGHT = Height of mother (cms) = X2

AGEMOTH = Age of mother (years) = X3

AGEFATH = Age of father (years) = X4

FAMINC = Monthly family income (Birr) = X5

GESTAT = Period of gestation (days) = X6

15
Number X1 X2 X3 X4 X5 X6

16
1 3.6 170 32 30 800 280
2 2.5 156 35 40 300 255
3 3.0 166 32 40 700 265
4 3.2 168 32 32 900 275
5 2.8 162 28 25 400 268
6 4.0 172 29 27 1000 280
7 3.4 170 27 35 650 276
8 1.8 152 22 40 180 255
9 2.4 156 18 24 250 263
10 3.6 169 24 24 750 278

11 1.9 153 18 36 150 259


12 3.1 164 33 35 850 265
13 3.0 163 37 40 500 266
14 2.8 163 23 30 250 267
15 3.8 171 33 30 750 276
16 4.0 172 27 29 1500 280
17 2.5 163 21 31 148 260
18 3.3 166 35 37 800 268
19 2.2 161 21 28 350 260
20 3.8 170 30 32 900 277

21 3.2 168 32 40 450 275


22 2.2 160 23 28 300 264
23 2.8 163 25 26 350 270
24 1.8 156 17 23 400 256
25 3.8 171 36 42 750 277
26 3.1 167 30 45 500 269
27 3.4 173 30 32 700 276
28 2.9 167 28 29 350 272
29 3.0 168 27 33 550 272
30 2.6 161 20 30 270 267
31 2.4 161 18 26 200 260

17
32 3.4 169 29 35 750 275
33 3.8 175 32 45 780 276
34 2.8 167 26 40 330 266
35 3.0 168 36 42 350 275
36 3.3 169 38 50 400 276
37 2.5 160 19 48 250 259
38 1.8 156 17 27 150 253
39 3.6 175 31 40 1200 278
40 3.2 170 29 39 1000 274

41 4.0 172 29 27 1000 280


42 3.4 170 27 35 650 276
43 1.8 152 22 40 180 255
44 2.4 156 18 24 250 263
45 3.6 169 24 24 750 278
46 4.0 172 27 29 1500 280
47 2.5 163 21 31 148 260
48 3.3 166 35 37 800 275
49 2.2 161 21 28 350 264
50 3.8 170 30 32 900 277

51 3.1 167 30 45 500 269


52 3.4 173 30 32 700 276
53 2.9 167 28 29 350 272
54 3.0 168 27 33 550 272
55 2.6 161 20 30 270 264
56 3.3 169 38 50 400 266
57 2.5 160 19 48 250 259
58 1.8 156 17 27 150 253
59 3.6 175 31 40 1200 278
60 3.2 170 29 39 1000 274

18
Answer the following questions based on the above data

a) Discuss about the possible objectives of the above


study?
b) What are the most likely reasons for conducting such
studies?

c) Check the association of each predictor with the


dependent variable.

d) Fit the full regression model

e) Fit the condensed regression model

f) What do you understand from your answers in parts c,


d and e ?

g) What is the proportion of variability accounted for by the


regression?

h) Compute the multiple correlation coefficient

i) Predict the birth weight of a baby born alive from a


woman aged 30 years and with the following additional
characteristics;

• height of mother =170 cm

• age of father =40 years

• monthly family income = 600 Birr

• period of gestation = 275 days

j) Estimate the birth weight of a baby born alive from a

woman with the same characteristics as in "j" but with a

19
mother's age of 49 years.

k) Write a short note on your findings

20

You might also like