Professional Documents
Culture Documents
Statistical Modelling of Epidemiological Data
Statistical Modelling of Epidemiological Data
11-Feb-21
What is Model?
Statistical modelling is a quantitative assessment of the size of the effect or it is a
As part of causal inferences (does being black result in a lower salary?) and
effects of discrimination
The data need to measure an out come variable and a predictor variable.
continuous scale
Then a range of relations between a response variable on the vertical axis and
variables
Suppression of a relation
No confounding
The systematic part is the average relation between the response and the
predictors,
While the random part is the variation in the response after taking account
11-Feb-21
The figure above displays data for 16 respondents, and a straight line
through the points to represent the systematic relation between salary and
length of employment
The line represents fitted values; if you have 10 years service you are
The intercept gives the predicted value of the response when the predictor
The slope gives the marginal change in the response variable for a unit
11-Feb-21
Intercept: is the average value for the response when all the predictors
Slope: there is one for each predictor and this summarizes the
Residual: the difference between the actual and fitted values based on
11-Feb-21
When only one predictor variable is used the model is called a simple
regression model
The term 'model' is used to denote the formal statistical formula, or
equation, that describes the relationship we believe exists between the
predictor and the outcome
For example:
More formally it says that the mean value of the outcome for any value
of the predictor variable is determined using a starting point, β0, when
X1 has the value 0 and, for each unit increase in XI , the outcome
changes by β units
β is usually referred to as the constant or the intercept
term whereas β1 is usually referred to as the regression
coefficient
The ε component is called the error and reflects the fact
that the relationship between Xl and Y is not exact.
We will assume that these errors are normally and
independently distributed, with zero mean and variance
σ2
We estimate these errors by residuals; these are the
difference between the observed (actual) value of the
observation and the value predicted by the model
11-Feb-21
When using X-variables to predict Y in a regression model
there is no necessary underlying assumption of causation;
we might just be estimating predictive associations
Terms such as 'X affects y', or the 'effect of X on Y is ... '
when interpreting the results of the models
The regression models used by epidemiologists mostly
contain more than one predictor variable
A regression model with more than one variable is known
as multiple regression models, or multivariable models
With two predictor variables, the regression model could be
written as:
11-Feb-21
The above Multivariable model suggests that:
we can predict the value of the outcome Y knowing the
baseline (intercept β0) and the values of the two independent
(predictor) variables (i.e. X1 &X2).
The parameters β1 and β2 describe the direction and
magnitude of the association of XI and X2 with Y
There can be as many X-variables as needed, not just two
A major difference from the simple regression model is that
in the above multivariable model, β1 is an estimate of the
effect of XI on Y after controlling for the effects of X2 and β2
is the estimated effect of X2 on Y after controlling for the
effects of Xl
As in simple regression, the model suggests that we cannot
predict Y exactly, so the random error term (ε) takes this into
account
11-Feb-21
Selection of variables for regression model
11-Feb-21
One can never be sure that there are not other
variables that were omitted from the model that also
affect Y and are related to one or more of the Xs
11-Feb-21
The ANOVA table
The idea behind using regression is that we believe that
information in X can be used to predict the value of Y
If we believe the X-variable contains information about the
Y-variable, we should be able to do a better job of
predicting the value of Y for a given X than if we did not
have that information
The formal way to approach this in regression is to
ascertain how much of the sums of squares of Y (the
numerator of the variance of y) we can explain with
knowledge of the X-variable(s)
This decomposition of the total sum of squares (SS) is
shown in the second column of the next Table (i.e.
SST=SSM+SSE; also, dIT=dfM+dfE)
11-Feb-21
11-Feb-21
In the formulae in the table, Y is the mean of the Ys, and k
is the number of predictor variables in the model (not
counting the intercept)
When the SS are divided by their degree s of freedom (dt),
the result is a mean square, here denoted as MSM
(model), MSE (error) and MST (total)
The MSE is our estimate of the error variance and
therefore also denoted as σ2
The sums of squares are partitioned by choosing values of
the βs that minimise the SSE (or MSE); hence the name
'least squares regression’
The formula for doing this, involves matrix algebra, but for
the simple linear regression model the βs can be
determined using:
11-Feb-21
For significance testing F-test of the ANOVA
table is used
11-Feb-21
Testing the significance of a regression
coefficient
A t-test with n-(k+ l) degrees of freedom (dfE) is used to evaluate the
significance of any of the regression coefficients, e.g. the jth coefficient
The usual null hypothesis is Ho : β=0 but any value of β* other than 0
can be used in Ho : βj=β* depending on the context
11-Feb-21
Interpreting R2 and adjusted R2
R2 is sometimes called the coefficient of determination of the model
It describes the amount of variance in the outcome variable that is
'explained' or 'accounted for’ by the predictor variables (see Example
above)
It also is the squared correlation coefficient between the predicted and
observed Y-values
R2 always increases as variables are added to a multiple regression
model which makes R2 useless for variable selection and potentially
misleading
Hence, R2 can be adjusted for the number of variables in the equation
(k), and this adjusted value will tend to decline if the variables added
contain little additional information about the outcome
The formula for the adjusted R2 is: adjusted R2 =1-(MSE/MST)
Notice the similarity with the formula R2 =SSM/SST= l-(SSE/SST)
11-Feb-21
The adjusted R2 is also useful for comparing the
relative predictive abilities of equations, with
different numbers of variables in them
11-Feb-21
In order to assess the impact of the set of variables, we
note the change in the error (residual) sum of squares
(SSE) before and after entering (or deleting) the set of
variables
Alliteratively, the model sum of squares, can be used
That is, note SSEfull with the variable set of interest in the
model (the 'full model'), then remove the set of variables
(e.g. Xj and Xj’) and note the SSEred ('reduced mode!’)
If variables Xj and Xj’ are important, then SSEfuIl <SSEred
(and SSMfull > SSMred)
The F-test to assess a set of variables is:
11-Feb-21
11-Feb-21
Modelling highly correlated (collinear) variables
Multiple regression is used to adjust for correlations among predictor
variables in the model
But if the variables are too highly correlated (the estimated effect of
each variable generally depends on the other variables in the model)
then a number of problems might arise
On one hand, this is the advantage of a multivariable analysis that
variables are studied while taking the others into account and there by
avoiding duplication of effects
In contrast, this means that the effect of any variable might change
when other variables are added to or removed from the model
The first problem arising from highly correlated (or collinear) predictors
is that estimated effects will depend strongly on the other predictors
present in the model
As a result, it might be difficult to statistically select the 'important'
predictors from a larger group of predictors
These concerns are less serious when the purpose of the analysis is
prediction than when interpretation of causal effects is the objective
11-Feb-21
The standard errors of regression coefficients might become very
large in a highly collinear model and hence we become less certain of
the likely magnitude of the association
To avoid this, a single X-variable should not be a perfectly correlated
to the another X-variable
If two (or more) variables are highly correlated (collinear, lal >0.8-0.9),
it will be difficult to select between (among) them for inclusion in the
regression equation
When two variables are highly and positively correlated, the resulting
coefficients (βs) will be highly and negatively correlated.
The best way of eliminating collinearity problems is:
-through considered exclusion of one of the variables, or
-by making a new combination of the variables on substantive
grounds
-In extreme situations specialised regression approaches, such as
ridge regression, might be needed
11-Feb-21
Detecting and modeling interaction
unaffected 11-Feb-21
11-Feb-21
When building a regression model, we need to balance
the desire to get the model which 'best fits' the data with
the desire for parsimony (simplicity in the model)
analysis
The maximum model is the model with all possible predictors of interest included
However, on the other hand, adding a lot of predictors increases the chances of:
a) Collinearity among predictor variables (if two or more independent variables are
b) Including variables that are not important 'in the real world' but happen to be
11-Feb-21
On the other hand, if lactation number is suspected to be
an important confounder, it might be designated to
remain in the model regardless of whether or not it is
statistically significant
11-Feb-21
In building regression models using datasets with a large number
of predictor variables is tricky
One rule of thumb suggests that there must be at least 10
observations for each predictor considered for inclusion in the
model
There are a variety of ways of reducing the number of variables
that need to be considered for inclusion in a regression model.
These include:
screening variables based on descriptive statistics
correlation analysis of independent variables
creation of indices
screening variables based on unconditional associations
principle components analysis/factor analysis
correspondence analysis
11-Feb-21
1. Screening variables based on descriptive statistics
are related into a single predictor that represents some overall level
of a factor
They provide insight into how predictor variables are related to each other
and ultimately, into how groups of predictors are related to the outcome of
interest
11-Feb-21
It is important to consider including interaction terms when specifying
the maximum model
There are five general strategies for creating and evaluating two-way
interactions.
1.Create and evaluate all possible two-way interaction terms. This will
only be feasible if the total number of predictors is small
2.Create two-way interactions among all predictors that are significant
in the final main effects model.
3.Create two-way interactions among a11 predictors found to have a
significant unconditional association with the outcome.
4.Create two-way interactions only among pairs of variables which you
suspect (based on evidence from the literature etc) might interact.
This will probably focus on interactions involving the primary
predictor(s) of interest and important confounders.
5. Create two-way interactions that involve the exposure variable
(predictor) of interest
11-Feb-21
Once a maximum model has been specified, you need to
decide how you will determine which predictors need to be
retained in the model
Non-statistical considerations
Variables should be retained in the model if they meet any
of the following criteria.
They are a primary predictor of interest
They are known, a priori, to be potential confounders for
the primary predictor of interest.
11-Feb-21
They show evidence of being a confounder in this
dataset because their removal results in a substantial
change in the coefficient for one of the primary
predictors of interest
They are a component of an interaction term which is
included in the model
Statistical criteria
to evaluating the statistical significance of individual
predictors
11-Feb-21
One e the criteria (both statistical and non-statistical) to be
The term with the largest partial F is added first and then
11-Feb-21
Some of the problems with automated model-building
procedures are that they:
yield R2 values which are too high
are based on methods (e.g. partial F-tests) which were
designed to test specific hypotheses in the data (as
opposed to evaluating all possible relationships) so
they produce P-values which are too small and
confidence intervals for parameters which are too
narrow (more on this below)
can have severe problems in the face of collinearity
cannot incorporate any of the non-statistical
considerations identified above
make the predictive ability of the model look belier than
it really is waste a lot of paper
11-Feb-21
In the process, investigators must incorporate their
biological knowledge of the system being studied
along with the results of the statistical analyses
11-Feb-21
The first step is to thoroughly evaluate the model using
regression 'diagnostics' (e.g. evaluating the normality of
residuals from a linear regression model)
This assesses the validity of the model and
procedures for doing this are described in each
chapter describing specific model types
The second step is to evaluate the reliability of the model
That is, to address the question of 'how well will the
model predict observations in subsequent samples?
Or how well the conclusions from a regression model
can be generalised – i.e. make future predictions
The two most common approaches to assessing
reliability are
split-sample analysis and
leave-one-out analysis 11-Feb-21
While presenting the result
11-Feb-21
• Now suppose the dependent variable y is binary
11-Feb-21
We can't use linear regression techniques to analyse these data as a
variance will also vary with the level of X and consequently, the
11-Feb-21
• This figure shows that while the logit
of p might become very large or very
small, p does not go beyond the
bounds of O and l
• In fact, logit values tend to remain
between -7 and +7 as these are
associated
• with very small (<0.001) and very
large (>0.999) probabilities,
respectively
11-Feb-21
This transformation leads to the logistic model in
which the probability of the outcome can be
expressed in one of the two following ways (they are
equivalent)
11-Feb-21
Odds and odds ratios
From this, we can compute the odds of disease (i.e. p/l-p). To simplify calculating the odds of disease:
11-Feb-21
From this it is a relatively simple process to
determine the odds ratio (OR) for disease that is
associated with the presence of factor 'X’
11-Feb-21
In this situation, the maximum value of the lnL can be determined
directly, but in many cases an iterative approach is required.
11-Feb-21
The logistic Regression Model
p
The ratio: is called the odds ratio
1 p
• This quantity will also increase with the value
of x, ranging from zero to infinity.
p
The quantity: ln
1 p
is called the log odds ratio
11-Feb-21
Suppose a die is rolled:
Success = “roll a six”, p = 1/6
1 1
p 1
The odds ratio 61 6
1 p 1 6 5
6 5
11-Feb-21
Assumes the log odds ratio is linearly
related to x.
p
i. e. : ln b 0 b1 x
1 p
In terms of the odds ratio
p
e b0 b1x
1 p
11-Feb-21
The logistic Regression Model
p pe b0 b1x e b0 b1x
e b0 b1x
or p
1 e b0 b1x
11-Feb-21
1
0.8
p 0.6
0.4
0.2
e b0
1 e b0
0
0 2 4 6 8 10
11-Feb-21
1
e b0 b1x
0.8
1 1
p p b 0 b1 x
0.6 1 e 11 2
whe
0.4
n
b0
0.2 b 0 b1 x 0 or x
b1
0
0 2 4 6 8 10
x
11-Feb-21
dp d e b0 b1x
dx dx 1 e b0 b1x
e b0 b1x b1 1 e b0 b1x e b0 b1 x b1e b0 b1 x
1 e
b 0 b1 x 2
e b0 b1 x b1 b1 whe b0
x
1 e b 0 b1 x
2
4 n b1
11-Feb-21
1
0.8
p 0.6 b1
slope
0.4 4
0.2
0
0 2 4 6 8 10
x
11-Feb-21
The data will for each case consist of
11-Feb-21
case x y
case x y
230 4.7 1
1 0.8 0
231 0.3 0
2 2.3 1
232 1.4 0
3 2.5 0
233 4.5 1
4 2.8 1
234 1.4 1
5 3.5 1
235 4.5 1
6 4.4 1
236 3.9 0
7 0.5 0
237 0.0 0
8 4.5 1
238 4.3 1
9 4.4 1
239 1.0 0
10 0.9 0
240 3.9 1
11 3.3 1
241 1.1 0
12 1.1 0
242 3.4 1
13 2.5 1
243 0.6 0
14 0.3 1
244 1.6 0
15 4.5 1
245 3.9 0
16 1.8 0
246 0.2 0
17 2.4 1
247 2.5 0
18 1.6 0
248 4.1 1
19 1.9 1
249 4.2 1
20 4.6 1
250 4.9 1
11-Feb-21
Estimation of the parameter
Rather than starting with the observed data and computing parameter
estimates (as is done with least squares estimates), one determines
the likelihood (probability) of the observed data for various
combinations of parameter values
The set of parameter values that was most likely to have produced
the observed data are the maximum likelihood (ML) estimates
11-Feb-21
Open the data file:
11-Feb-21
Choose from the menu:
Analyze -> Regression -> Binary Logistic
11-Feb-21
The following dialogue box appears
11-Feb-21
Here is the output
11-Feb-21
b SE
X 1.0309 0.1334
Constant -2.0475 0.332
b1 1.0309
b0 -2.0475
11-Feb-21
e b0 e-2.0475
intercept b0
0.1143
1 e 1 e -2.0475
11-Feb-21
Another interpretation of the parameter b1
b1 is the rate of increase in p with
4 respect to x when p = 0.50
b1 1.0309
0.258
4 4
11-Feb-21
The Multiple Logistic Regression model
p
ln b 0 b1 X 1 bpX p
1 p
b b X b p X p
e0 1 1
or p b b X b p X p
1 e 0 1 1
11-Feb-21
In this example we are interested in determining the risk of infants (who
11-Feb-21
For n = 223 infants in prenatal ward the following
measurements were determined
1. X1 = gestational Age
(weeks),
2. X2 = Birth weight (grams)
and
3. Y = presence of BPD
11-Feb-21
case Gestational Age Birthweight presence of BMD
1 28.6 1119 1
2 31.5 1222 0
3 30.3 1311 1
4 28.9 1082 0
5 30.3 1269 0
6 30.5 1289 0
7 28.5 1147 0
8 27.9 1136 1
9 30 972 0
10 31 1252 0
11 27.4 818 0
12 29.4 1275 0
13 30.8 1231 0
14 30.4 1112 0
15 31.1 1353 1
16 26.7 1067 1
17 27.4 846 1
18 28 1013 0
19 29.3 1055 0
20 30.4 1226 0
21 30.2 1237 0
22 30.2 1287 0
23 30.1 1215 0
24 27 929 1
25 30.3 1159 0
26 27.4 1046 1
11-Feb-21
Va riable s in the Equati on
p
ln 16.858 .003BW .505GA
1 p
p
e16.858.003 BW .505GA
1 p
e16.858.003 BW .505GA
p
1 e16.858.003 BW .505GA
11-Feb-21
1
0.8
GA = 27
0.6 GA = 28
GA = 29
GA = 30
0.4 GA = 31
GA = 32
0.2
0
700 900 1100 1300 1500 1700
11-Feb-21