Statistical Modelling of Epidemiological Data

Statistical modelling of epidemiological data
11-Feb-21
What is Model?
 Statistical modelling is a quantitative assessment of the size of the effect or it is a
quantitative assessment after taking account of other variables or a measure of

uncertainty for the size of effect
 An epidemiological model is usually defined as 'a mathematical and/or logical
representation of the epidemiology of disease transmission and its associated

processes‟
 Thus we can use modelling in a number of modes:
 As description (what is the average salary for different ethnic groups?)
 As part of causal inferences (does being black result in a lower salary?) and
 In predictive mode („what happens if‟ questions)

11-Feb-21
 Modelling requires a quantifiable/qualitative outcome measure to assess the
effects of discrimination
 The data need to measure an out come variable and a predictor variable.
 Predictor variables can be measured on varies scales.
 For instance gender is measured as two categories (Male, female),
 Ethnicity as four (if there are four ethnic groups),
 Education as a set of ordered categories, and years of employment on a
continuous scale
 Then a range of relations between a response variable on the vertical axis and
predictor variables on the horizontal can be made 11-Feb-21

 We may be interested in the effect of just one variable on another but
we need to take account of other variables as they may compromise

the results
 We can recognize three distinct cases:
 Inflation of a relation when not taking into account extraneous
variables
 Suppression of a relation
 No confounding
 For example in case of exact collinearity (complete dependence
between a pair or more variables) a separate effect cannot be estimated

11-Feb-21
Form of the model
 All statistical models have a common form:
Response=Systematic part+ Random part
 The systematic part is the average relation between the response and the
predictors,
 While the random part is the variation in the response after taking account
of all the included predictors
11-Feb-21
 The figure above displays data for 16 respondents, and a straight line
through the points to represent the systematic relation between salary and
length of employment
 The line represents fitted values; if you have 10 years service you are
predicted to have a salary of about £45,000
 Thus the model will take: Fitted value= Intercept + (Slope*Predictor)
 The intercept gives the predicted value of the response when the predictor
takes on the value of zero
 The slope gives the marginal change in the response variable for a unit
change in the predictor variable
11-Feb-21
 Intercept: is the average value for the response when all the predictors
take the value zero
 Slope: there is one for each predictor and this summarizes the
conditional or partial relationship as the change in the response for a

unit change in a particular predictor, holding all the other predictors
constant
 Residual: the difference between the actual and fitted values based on
all the predictors
 R-squared: the percentage of the total variation of the response
variable that is accounted for by all predictors taken simultaneously

11-Feb-21
Linear regression model
 Linear regression a statistical tool that is suitable for
modelling the outcome when it is measured on a
continuous scale
 The outcome variable is denoted as the dependent, or

outcome, variable, whereas the 'causal' variables are called
the independent or predictor variables
 The predictor variable of primary interest is referred as the

exposure variable(s)
 The predictor variables can be measured on a continuous,
categorical or dichotomous scale
11-Feb-21
 When only one predictor variable is used the model is called a simple
regression model
 The term 'model' is used to denote the formal statistical formula, or
equation, that describes the relationship we believe exists between the
predictor and the outcome
 For example:
 This model is a statistical way of describing how the value of the

outcome (variable Y), changes across population groups formed by the
values of the predictor variable XI’
 More formally it says that the mean value of the outcome for any value
of the predictor variable is determined using a starting point, β0, when
X1 has the value 0 and, for each unit increase in XI , the outcome
changes by β units
 β is usually referred to as the constant or the intercept
term whereas β1 is usually referred to as the regression
coefficient
 The ε component is called the error and reflects the fact
that the relationship between Xl and Y is not exact.
 We will assume that these errors are normally and
independently distributed, with zero mean and variance
σ2
 We estimate these errors by residuals; these are the
difference between the observed (actual) value of the
observation and the value predicted by the model
11-Feb-21
 When using X-variables to predict Y in a regression model
there is no necessary underlying assumption of causation;
we might just be estimating predictive associations
 Terms such as 'X affects y', or the 'effect of X on Y is ... '
when interpreting the results of the models
 The regression models used by epidemiologists mostly
contain more than one predictor variable
 A regression model with more than one variable is known
as multiple regression models, or multivariable models
 With two predictor variables, the regression model could be
written as:
11-Feb-21
The above Multivariable model suggests that:
 we can predict the value of the outcome Y knowing the
baseline (intercept β0) and the values of the two independent
(predictor) variables (i.e. X1 &X2).
 The parameters β1 and β2 describe the direction and
magnitude of the association of XI and X2 with Y
 There can be as many X-variables as needed, not just two
 A major difference from the simple regression model is that
in the above multivariable model, β1 is an estimate of the
effect of XI on Y after controlling for the effects of X2 and β2
is the estimated effect of X2 on Y after controlling for the
effects of Xl
 As in simple regression, the model suggests that we cannot
predict Y exactly, so the random error term (ε) takes this into
account
11-Feb-21
Selection of variables for regression model
 more than one predictor variable almost always leads to:

 a more complete understanding of how the outcome
varies and
 it also decreases the likelihood that the regression
coefficients are biased by confounding variables
 However, the more variables included in a model, the
greater the estimated standard errors become, and the
more dependent the model becomes on the observed data
 Thus statistical model building involves selection of
variables seeking the most parsimonious model that
explains the data
 The rationale for selection or minimizing the number of
variables in the model is that the resultant model is more
likely to be numerically stable, and is more easily
generalized
 Including all clinically and intuitively relevant variables in
the model, regardless of their "statistical significance."
provides a complete control of confounding within the
given data set
 When taken collectively it is possible for individual
variables not to exhibit strong confounding
 Intervening variables
 Limiting or not included intervening variables, or effects
of the outcome in the model, the βs are not biased
(confounded) by any variable included in the regression
equation, but can be biased if confounding variables are
omitted from the equation
11-Feb-21
 One can never be sure that there are not other
variables that were omitted from the model that also
affect Y and are related to one or more of the Xs
 These X-variables could be unknown, not thought (at

least initially) to be important, or (as it often happens)
not practical/possible to measure
 In other circumstances we might have numerous

potential confounders and need to decide on the
important ones to include
 A major trade-off in model-building is to avoid omitting

necessary variables which could confound the
relationship described by the βs 11-Feb-21
Table showing variables selected from the dataset on the impact of diseases on
reproductive performance in dairy cows on two farms
11-Feb-21
The ANOVA table
 The idea behind using regression is that we believe that
information in X can be used to predict the value of Y
 If we believe the X-variable contains information about the
Y-variable, we should be able to do a better job of
predicting the value of Y for a given X than if we did not
have that information
 The formal way to approach this in regression is to
ascertain how much of the sums of squares of Y (the
numerator of the variance of y) we can explain with
knowledge of the X-variable(s)
 This decomposition of the total sum of squares (SS) is
shown in the second column of the next Table (i.e.
SST=SSM+SSE; also, dIT=dfM+dfE)
11-Feb-21
11-Feb-21
 In the formulae in the table, Y is the mean of the Ys, and k
is the number of predictor variables in the model (not
counting the intercept)
 When the SS are divided by their degree s of freedom (dt),
the result is a mean square, here denoted as MSM
(model), MSE (error) and MST (total)
 The MSE is our estimate of the error variance and
therefore also denoted as σ2
 The sums of squares are partitioned by choosing values of
the βs that minimise the SSE (or MSE); hence the name
'least squares regression’
 The formula for doing this, involves matrix algebra, but for
the simple linear regression model the βs can be
determined using:
11-Feb-21
 For significance testing F-test of the ANOVA
table is used
 The null hypothesis is Ho : β1=β2= .. β k=O (i.e.

all regression coefficients except the intercept
are zero)
 The alterative hypothesis is that, at least some

(but not necessarily all) of the βs are non-zero
11-Feb-21
Testing the significance of a regression
coefficient
 A t-test with n-(k+ l) degrees of freedom (dfE) is used to evaluate the
significance of any of the regression coefficients, e.g. the jth coefficient
 The usual null hypothesis is Ho : β=0 but any value of β* other than 0
can be used in Ho : βj=β* depending on the context
 The t-test formula is:
 where SE(β𝑗) is the standard error of the estimated coefficient

 This standard error is always computed as the root MSE times some
constant depending on the formula for the estimated coefficient and the
values of the X-variables in the model
 For a model with only one predictor (XI)' the standard error (SE) of
the regression coefficient is:
11-Feb-21
Interpreting R2 and adjusted R2
 R2 is sometimes called the coefficient of determination of the model
 It describes the amount of variance in the outcome variable that is
'explained' or 'accounted for’ by the predictor variables (see Example
above)
 It also is the squared correlation coefficient between the predicted and
observed Y-values
 R2 always increases as variables are added to a multiple regression
model which makes R2 useless for variable selection and potentially
misleading
 Hence, R2 can be adjusted for the number of variables in the equation
(k), and this adjusted value will tend to decline if the variables added
contain little additional information about the outcome
 The formula for the adjusted R2 is: adjusted R2 =1-(MSE/MST)
 Notice the similarity with the formula R2 =SSM/SST= l-(SSE/SST)
11-Feb-21
 The adjusted R2 is also useful for comparing the
relative predictive abilities of equations, with
different numbers of variables in them
 Adjusted R2 is also used as a basis for selecting

potentially good models
11-Feb-21
 In order to assess the impact of the set of variables, we
note the change in the error (residual) sum of squares
(SSE) before and after entering (or deleting) the set of
variables
 Alliteratively, the model sum of squares, can be used
 That is, note SSEfull with the variable set of interest in the
model (the 'full model'), then remove the set of variables
(e.g. Xj and Xj’) and note the SSEred ('reduced mode!’)
 If variables Xj and Xj’ are important, then SSEfuIl <SSEred
(and SSMfull > SSMred)
 The F-test to assess a set of variables is:
11-Feb-21
11-Feb-21
Modelling highly correlated (collinear) variables
 Multiple regression is used to adjust for correlations among predictor
variables in the model
 But if the variables are too highly correlated (the estimated effect of
each variable generally depends on the other variables in the model)
then a number of problems might arise
 On one hand, this is the advantage of a multivariable analysis that
variables are studied while taking the others into account and there by
avoiding duplication of effects
 In contrast, this means that the effect of any variable might change
when other variables are added to or removed from the model
 The first problem arising from highly correlated (or collinear) predictors
is that estimated effects will depend strongly on the other predictors
present in the model
 As a result, it might be difficult to statistically select the 'important'
predictors from a larger group of predictors
 These concerns are less serious when the purpose of the analysis is
prediction than when interpretation of causal effects is the objective
11-Feb-21
 The standard errors of regression coefficients might become very
large in a highly collinear model and hence we become less certain of
the likely magnitude of the association
 To avoid this, a single X-variable should not be a perfectly correlated
to the another X-variable
 If two (or more) variables are highly correlated (collinear, lal >0.8-0.9),
it will be difficult to select between (among) them for inclusion in the
regression equation
 When two variables are highly and positively correlated, the resulting
coefficients (βs) will be highly and negatively correlated.
 The best way of eliminating collinearity problems is:
-through considered exclusion of one of the variables, or
-by making a new combination of the variables on substantive
grounds
-In extreme situations specialised regression approaches, such as
ridge regression, might be needed
11-Feb-21
Detecting and modeling interaction
Interaction is when two factors act synergistically or antagonistically
 In the situation where X-variables are not indicator variables of a
categorical variable, the interaction term is formed by the product XI *
X2 which can be tested in the following model:
 If interaction is absent (β3 is deemed to be not different from O), the
main effects (or 'additive') model is deemed to describe the effects
adequately. It is not necessary to centre variables (XI and X2) to see if
an interaction term is needed, because β3 and its standard error will be
unaffected 11-Feb-21
11-Feb-21
 When building a regression model, we need to balance
the desire to get the model which 'best fits' the data with
the desire for parsimony (simplicity in the model)
 The definition of 'best fit' depends on the goal of the
analysis
 One goal might be to come up with the best model for
predicting future observations
 A second goal could be to obtain the most precise
estimates possible of coefficients for selected variables of

interest 11-Feb-21
 The steps involved in building a regression model are:
1. Specify the maximum model to be considered (i.e.
identity the outcome and the full set of predictors that

you want to consider)
2. Specify the criterion (criteria) to be used in selecting the
variables to be included in the model
3. Specify the strategy for applying the criterion (criteria)
4. Conduct the analyses
5. Evaluate the reliability of the model chosen
6. Present the results 11-Feb-21

 Specifying the maximum model is to identify the outcome variable and determine
whether it is likely to need transformation (e.g. natural log transformation) or other

form of manipulation
 The maximum model is the model with all possible predictors of interest included
 However, on the other hand, adding a lot of predictors increases the chances of:
a) Collinearity among predictor variables (if two or more independent variables are
highly correlated, the estimates of their coefficients in a regression model will be

unstable), and
b) Including variables that are not important 'in the real world' but happen to be
significant in your dataset
 Interpretation of these results might be difficult and the risk of identifying
spurious associations is high 11-Feb-21

 It is imperative that you have a causal model in place before you begin
the modelbuilding process
 This will identify potential causal relationships among the predictors and
the outcome of interest
 Example, if you were interested in evaluating the effects of retained
placenta (RETPLA) on reproductive performance (as measured by
the calving-to-conception interval) in multiparous dairy cows and had
recorded data on:
 the lactation number (surrogate measure for cow's age) (LACT)
 previous lactation milk production (kg) (MILK)
 dystocia (DYST)
 retained placenta (RETPLA)
 metritis (METRITIS)
 days from calving to first service (CFS)
 days from calving to conception (CC),
then a putative causal diagram might look like
11-Feb-21
• If the objective of the study is to quantify the effects of RETPLA on the
calving to conception interval, it is not necessary to include any intervening
variables (such as metritis, days to first service) in the regression model
• Inclusion of these variables would remove any of the effect from RETPLA
that was mediated through the intervening variables
11-Feb-21
 On the other hand, if lactation number is suspected to be
an important confounder, it might be designated to
remain in the model regardless of whether or not it is
statistically significant
 Even if a study has a very large number of predictors, it is

essential to start with a causal structure in mind and this
can often be drawn by grouping variables into logical
clusters
 e.g. all farm management variables together, all

measures of disease levels together
11-Feb-21
 In building regression models using datasets with a large number
of predictor variables is tricky
 One rule of thumb suggests that there must be at least 10
observations for each predictor considered for inclusion in the
model
 There are a variety of ways of reducing the number of variables
that need to be considered for inclusion in a regression model.
These include:
 screening variables based on descriptive statistics
 correlation analysis of independent variables
 creation of indices
 screening variables based on unconditional associations
 principle components analysis/factor analysis
 correspondence analysis
11-Feb-21
1. Screening variables based on descriptive statistics
 Descriptive statistics (means, variances, percentiles etc.

for continuous variables and frequency tabulations for
categorical variables) can be very helpful in identifying
variables which might be of little value in your model
 Some specific guidelines to consider are:
a. Avoid variables with large numbers of missing
observations
b. Select only variables with substantial variability (e.g. if
almost all of the animals in a study are males, adding
sex as a predictor is not likely to be helpful)
c. If a categorical variable has many categories with small
numbers of observations in each, consider combining
categories (if this makes biological sense), or eliminating
the variable.
11-Feb-21
2. Correlation analysis
 Inclusion of highly correlated variables will result in
multicollinearity in the model, producing unstable estimates

of coefficients and incorrect standard errors.
 Collinearity will certainly be a problem with correlation
coefficients greater than 0.9, but could occur at lower

levels
 If pairs of highly correlated variables are found, one of
them should be selected for inclusion in the model based

on criteria such as: biological plausibility, fewer missing
11-Feb-21
observations, ease and/or reliability of measurement.
3. Creation of indices
 It might be possible to combine a number of predictor variables that
are related into a single predictor that represents some overall level
of a factor
 For example, an index representing the level of hygiene in stalls for
dairy cows might be created as a linear combination of scores for

factors such as:
♥ quantity of bedding present,
♥ wetness of the bedding,
♥ amount of manure present,
♥ amount of faecal soiling of the udder and

11-Feb-21
♥ flanks of the cows

4. Principle components analysis/factor analysis
 Principle components analysis is used to convert a set of k
predictor variables into a set of k principle components with

each successive component containing a decreasing
proportion of the total variation among the original predictor
variables
 Because most of the variation is often contained in the first
few principle components, this sub set is often selected for

use as predictors in the regression model
 Factor analysis is based on the assumption that a set of
factors that have inherent meaning can be created from the

11-Feb-21
original variables
5. Correspondence analysis
 Correspondence analysis is a form of exploratory data analysis designed

to analyse the relationships among a set of categorical variables
 One of the main objectives of correspondence analysis is to produce a

visual summary (usually two-dimensional) of the complex relationships
that exist among a set of categorical variables (both predictors and the
outcome)
 Principle components analysis, factor analysis and correspondence

analysis can be used to deal with the problem of large numbers of
independent variables, they are perhaps better viewed as complementary
techniques to model-building procedures
 They provide insight into how predictor variables are related to each other
and ultimately, into how groups of predictors are related to the outcome of
interest
11-Feb-21
 It is important to consider including interaction terms when specifying
the maximum model
 There are five general strategies for creating and evaluating two-way
interactions.
1.Create and evaluate all possible two-way interaction terms. This will
only be feasible if the total number of predictors is small
2.Create two-way interactions among all predictors that are significant
in the final main effects model.
3.Create two-way interactions among a11 predictors found to have a
significant unconditional association with the outcome.
4.Create two-way interactions only among pairs of variables which you
suspect (based on evidence from the literature etc) might interact.
This will probably focus on interactions involving the primary
predictor(s) of interest and important confounders.
5. Create two-way interactions that involve the exposure variable
(predictor) of interest
11-Feb-21
 Once a maximum model has been specified, you need to
decide how you will determine which predictors need to be
retained in the model
Non-statistical considerations
 Variables should be retained in the model if they meet any
of the following criteria.
 They are a primary predictor of interest
 They are known, a priori, to be potential confounders for
the primary predictor of interest.
11-Feb-21
 They show evidence of being a confounder in this
dataset because their removal results in a substantial
change in the coefficient for one of the primary
predictors of interest
 They are a component of an interaction term which is
included in the model
Statistical criteria
 to evaluating the statistical significance of individual
predictors
11-Feb-21
 One e the criteria (both statistical and non-statistical) to be
used in the selection process have been specified, there are a

number of ways to actually carry out the selection
a) All possible/best subset regressions
 If the number of predictors in the maximum model is small,
then it is possible to examine all possible combinations of

predictors
 Once all of the models have been fit, it is relatively easy to
apply both the non-statistical and statistical criteria described

above in order to select the 'best' model 11-Feb-21
b) Forward selection/ backward elimination/stepwise
 When a forward selection process is used, the computer
first fits a model with only the intercept and then

selectively adds terms that meet a specified criterion.
 The usual criterion for inclusion is a partial F-test statistic
 The term with the largest partial F is added first and then
the process is repeated
 This continues until no term meets the entry criterion.
 If there is a very large number of potential predictors,
forward selection might be the only feasible approach

11-Feb-21
 Backward elimination means, the process is reversed
 The maximum model is fit and then terms are removed
sequentially until none of the terms remaining in the
model has a partial F statistic under the specified criterion
 An advantage of backward elimination is that the

statistical significance of terms is assessed after
adjustment for the potential confounding effect of other
variables in the model
 Stepwise regression is simply a combination of forward

selection and backward elimination
11-Feb-21
 Some of the problems with automated model-building
procedures are that they:
 yield R2 values which are too high
 are based on methods (e.g. partial F-tests) which were
designed to test specific hypotheses in the data (as
opposed to evaluating all possible relationships) so
they produce P-values which are too small and
confidence intervals for parameters which are too
narrow (more on this below)
 can have severe problems in the face of collinearity
 cannot incorporate any of the non-statistical
considerations identified above
 make the predictive ability of the model look belier than
it really is waste a lot of paper
11-Feb-21
 In the process, investigators must incorporate their
biological knowledge of the system being studied
along with the results of the statistical analyses
11-Feb-21
 The first step is to thoroughly evaluate the model using
regression 'diagnostics' (e.g. evaluating the normality of
residuals from a linear regression model)
 This assesses the validity of the model and
procedures for doing this are described in each
chapter describing specific model types
 The second step is to evaluate the reliability of the model
 That is, to address the question of 'how well will the
model predict observations in subsequent samples?
 Or how well the conclusions from a regression model
can be generalised – i.e. make future predictions
 The two most common approaches to assessing
reliability are
 split-sample analysis and
 leave-one-out analysis 11-Feb-21
 While presenting the result
 to present the coefficients (don't forget to include the

intercept),
 their standard errors and/or their confidence interval
 the coefficients represent the change that would be

expected in the outcome for a unit change in the predictor
or it represents the effect of the factor being present
compared with when it is absent for dichotomous
predictors
11-Feb-21
LOGISTIC REGRESSION
 Recall the simple linear regression model:

y = b 0 + b 1x + e
 where we are trying to predict a continuous dependent
variable y from a continuous independent variable x
 This model can be extended to Multiple linear regression
model:
y = b0 + b1 x1 + b2 x2 + … + + bp xp + e
 Here we are trying to predict a continuous dependent

variable y from a several continuous dependent variables
x1 , x2 , … , xp
11-Feb-21
• Now suppose the dependent variable y is binary
• It takes on two values “Success” (1) or “Failure” (0)
• We are interested in predicting a y from a continuous independent variable x
• This is the situation in which Logistic Regression is used
• In veterinary epidemiology, we are often in the situation where the

outcome in our study is dichotomous (i.e. Y=0 or 1)
• Most commonly, this variable represents either the presence or

absence of disease or mortality
11-Feb-21
 We can't use linear regression techniques to analyse these data as a
function of a set of linear predictors X=(Xj) for the following reasons:
 The error terms (ε) are not normally distributed
 The probability of the outcome occurring (i.e. p(Y=I)) depends on the
values of the predictor variables (i.e. X). Since the variance of a
binomial distribution is a function of the probability (P), the error
variance will also vary with the level of X and consequently, the
assumption of homoscedasticity will be violated

11-Feb-21
The log it transform
 One way of getting around the problems described above
is to use a logit transform of the probability of the outcome
and model this as a linear function of a set of predictor
variables
11-Feb-21
• This figure shows that while the logit
of p might become very large or very
small, p does not go beyond the
bounds of O and l
• In fact, logit values tend to remain
between -7 and +7 as these are
associated
• with very small (<0.001) and very
large (>0.999) probabilities,
respectively
11-Feb-21
 This transformation leads to the logistic model in
which the probability of the outcome can be
expressed in one of the two following ways (they are
equivalent)
11-Feb-21
Odds and odds ratios
 When the occurrence of disease is the event of interest (Y=0 or 1)

and we have a single dichotomous predictor variable (i.e. x=0 or 1)
 The probability of disease becomes
From this, we can compute the odds of disease (i.e. p/l-p). To simplify calculating the odds of disease:
Then it follows that:
11-Feb-21
 From this it is a relatively simple process to
determine the odds ratio (OR) for disease that is
associated with the presence of factor 'X’
This can be extended to the situation in which there are

multiple predictors and the O for the k th variable will be eβK
11-Feb-21
In this situation, the maximum value of the lnL can be determined
directly, but in many cases an iterative approach is required.
11-Feb-21
The logistic Regression Model
 Let p denote P[y = 1] = P[Success].
 This quantity will increase with the value of x
p
The ratio: is called the odds ratio
1 p
• This quantity will also increase with the value
of x, ranging from zero to infinity.
 p 
The quantity: ln  
 1 p 
is called the log odds ratio
11-Feb-21
Suppose a die is rolled:
Success = “roll a six”, p = 1/6
1 1
p 1
The odds ratio  61  6

1 p 1 6 5
6 5
The log odds ratio

 p  1
ln    ln    ln  0.2   1.69044
 1 p  5
11-Feb-21
Assumes the log odds ratio is linearly
related to x.
 p 
i. e. : ln    b 0  b1 x
 1 p 
In terms of the odds ratio
p
 e b0  b1x
1 p
11-Feb-21
The logistic Regression Model
Solving for p in terms x.

p
 e b0  b1x
1 p
p  e b0  b1x 1  p 
p  pe b0  b1x  e b0  b1x
e b0  b1x
or p
1  e b0  b1x
11-Feb-21
1
0.8
p 0.6
0.4
0.2
e b0
1  e b0
0
0 2 4 6 8 10
11-Feb-21
1
e b0  b1x
0.8
1 1
p p b 0  b1 x
 
0.6 1 e 11 2
whe
0.4
n
b0
0.2 b 0  b1 x  0 or x  
b1
0
0 2 4 6 8 10
x
11-Feb-21
dp d e b0  b1x

dx dx 1  e b0  b1x

 
e b0  b1x b1 1  e b0  b1x  e b0  b1 x b1e b0  b1 x
1  e 
b 0  b1 x 2
e b0  b1 x b1 b1 whe b0
  x
1  e b 0  b1 x

2
4 n b1
b1 is the rate of increase in p with respect to x when p =

4 0.50
11-Feb-21
1
0.8
p 0.6 b1
slope 
0.4 4
0.2
0
0 2 4 6 8 10
x
11-Feb-21
The data will for each case consist of
1. a value for x, the continuous independent

variable
2. a value for y (1 or 0) (Success or Failure)
Total of n = 250 cases
11-Feb-21
case x y
case x y
230 4.7 1
1 0.8 0
231 0.3 0
2 2.3 1
232 1.4 0
3 2.5 0
233 4.5 1
4 2.8 1
234 1.4 1
5 3.5 1
235 4.5 1
6 4.4 1
236 3.9 0
7 0.5 0
237 0.0 0
8 4.5 1
238 4.3 1
9 4.4 1
239 1.0 0
10 0.9 0
240 3.9 1
11 3.3 1
241 1.1 0
12 1.1 0
242 3.4 1
13 2.5 1
243 0.6 0
14 0.3 1
244 1.6 0
15 4.5 1
245 3.9 0
16 1.8 0
246 0.2 0
17 2.4 1
247 2.5 0
18 1.6 0
248 4.1 1
19 1.9 1
249 4.2 1
20 4.6 1
250 4.9 1
11-Feb-21
Estimation of the parameter
 In a logistic model, maximum likelihood estimation procedure to

estimate the coefficients is used
 The key feature of maximum likelihood estimation is that it estimates

values for parameters (the βs) which are most likely to have produced
the data that have been observed
 Rather than starting with the observed data and computing parameter
estimates (as is done with least squares estimates), one determines
the likelihood (probability) of the observed data for various
combinations of parameter values
 The set of parameter values that was most likely to have produced
the observed data are the maximum likelihood (ML) estimates
11-Feb-21
Open the data file:
11-Feb-21
Choose from the menu:
Analyze -> Regression -> Binary Logistic
11-Feb-21
The following dialogue box appears
Select the dependent variable (y) and the independent

variable (x) (covariate).
Press OK.
11-Feb-21
Here is the output
The Estimates and their S.E.
11-Feb-21
b SE
X 1.0309 0.1334
Constant -2.0475 0.332
b1 1.0309
b0 -2.0475
11-Feb-21
e b0 e-2.0475
intercept  b0
  0.1143
1 e 1 e -2.0475
Interpretation of the parameter b1

(determines when p is 0.50 (along with b0))
b0 2.0475
x   1.986
b1 1.0309
11-Feb-21
Another interpretation of the parameter b1
b1 is the rate of increase in p with
4 respect to x when p = 0.50
b1 1.0309
  0.258
4 4
11-Feb-21
The Multiple Logistic Regression model
Here we attempt to predict the outcome of a binary

response variable Y from several independent
variables X1, X2 , … etc
 p 
ln    b 0  b1 X 1   bpX p
 1 p 
b b X  b p X p
e0 1 1
or p b b X  b p X p
1 e 0 1 1
11-Feb-21
 In this example we are interested in determining the risk of infants (who
were born prematurely) of developing BPD (bronchopulmonary dysplasia)
 More specifically we are interested in developing a predictive model which
will determine the probability of developing BPD from
 X1 = gestational Age and X2 = Birthweight
11-Feb-21
For n = 223 infants in prenatal ward the following
measurements were determined
1. X1 = gestational Age
(weeks),
2. X2 = Birth weight (grams)
and
3. Y = presence of BPD
11-Feb-21
case Gestational Age Birthweight presence of BMD
1 28.6 1119 1
2 31.5 1222 0
3 30.3 1311 1
4 28.9 1082 0
5 30.3 1269 0
6 30.5 1289 0
7 28.5 1147 0
8 27.9 1136 1
9 30 972 0
10 31 1252 0
11 27.4 818 0
12 29.4 1275 0
13 30.8 1231 0
14 30.4 1112 0
15 31.1 1353 1
16 26.7 1067 1
17 27.4 846 1
18 28 1013 0
19 29.3 1055 0
20 30.4 1226 0
21 30.2 1237 0
22 30.2 1287 0
23 30.1 1215 0
24 27 929 1
25 30.3 1159 0
26 27.4 1046 1
11-Feb-21
Va riable s in the Equati on
B S.E. Wald df Sig. Ex p(B)

Step
a
Birthweight -.003 .001 4.885 1 .027 .998
1 GestationalAge -.505 .133 14. 458 1 .000 .604
Constant 16. 858 3.642 21. 422 1 .000 2.1E+07
a. Variable(s) entered on step 1: Birthweight , Ges tationalAge.
 p 
ln    16.858  .003BW  .505GA
 1 p 
p
 e16.858.003 BW .505GA
1 p
e16.858.003 BW .505GA
p
1  e16.858.003 BW .505GA
11-Feb-21
1
0.8
GA = 27
0.6 GA = 28
GA = 29
GA = 30
0.4 GA = 31
GA = 32
0.2
0
700 900 1100 1300 1500 1700
11-Feb-21

Statistical Modelling of Epidemiological Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Modelling of Epidemiological Data

Uploaded by

Copyright:

Available Formats

Statistical modelling of epidemiological data

quantitative assessment after taking account of other variables or a measure of

 An epidemiological model is usually defined as 'a mathematical and/or logical

representation of the epidemiology of disease transmission and its associated

 Thus we can use modelling in a number of modes:

 As description (what is the average salary for different ethnic groups?)

 In predictive mode („what happens if‟ questions)

 Predictor variables can be measured on varies scales.

 For instance gender is measured as two categories (Male, female),

 Ethnicity as four (if there are four ethnic groups),

 Education as a set of ordered categories, and years of employment on a

predictor variables on the horizontal can be made 11-Feb-21

we need to take account of other variables as they may compromise

 We can recognize three distinct cases:

 Inflation of a relation when not taking into account extraneous

 For example in case of exact collinearity (complete dependence

between a pair or more variables) a separate effect cannot be estimated

Response=Systematic part+ Random part

of all the included predictors

predicted to have a salary of about £45,000

 Thus the model will take: Fitted value= Intercept + (Slope*Predictor)

takes on the value of zero

change in the predictor variable

take the value zero

conditional or partial relationship as the change in the response for a

all the predictors

 R-squared: the percentage of the total variation of the response

variable that is accounted for by all predictors taken simultaneously

 The outcome variable is denoted as the dependent, or

 The predictor variable of primary interest is referred as the

 This model is a statistical way of describing how the value of the

 more than one predictor variable almost always leads to:

 These X-variables could be unknown, not thought (at

 In other circumstances we might have numerous

 A major trade-off in model-building is to avoid omitting

 The null hypothesis is Ho : β1=β2= .. β k=O (i.e.

 The alterative hypothesis is that, at least some

 The t-test formula is:

 where SE(β𝑗) is the standard error of the estimated coefficient

 Adjusted R2 is also used as a basis for selecting

Interaction is when two factors act synergistically or antagonistically

 In the situation where X-variables are not indicator variables of a

categorical variable, the interaction term is formed by the product XI *

X2 which can be tested in the following model:

 If interaction is absent (β3 is deemed to be not different from O), the

main effects (or 'additive') model is deemed to describe the effects

adequately. It is not necessary to centre variables (XI and X2) to see if

an interaction term is needed, because β3 and its standard error will be

 The definition of 'best fit' depends on the goal of the

 One goal might be to come up with the best model for

predicting future observations

 A second goal could be to obtain the most precise

estimates possible of coefficients for selected variables of

1. Specify the maximum model to be considered (i.e.

identity the outcome and the full set of predictors that

2. Specify the criterion (criteria) to be used in selecting the

variables to be included in the model

3. Specify the strategy for applying the criterion (criteria)

4. Conduct the analyses

5. Evaluate the reliability of the model chosen

6. Present the results 11-Feb-21

whether it is likely to need transformation (e.g. natural log transformation) or other

highly correlated, the estimates of their coefficients in a regression model will be