Generalised Linear Models

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Generalised linear models

Model… modelling…
• What is a model?
• A model is just a simple abstraction of reality in that it provides an
approximation of some relatively more complex phenomenon.
• Models: deterministic or probabilistic.
• Deterministic: the system outcomes and responses are precisely
defined, often by a set of equations.
• Give an example of deterministic models.
Model… modelling… (2)
• Probabilistic models: the system outcomes or responses exhibit
variability, because the model either contains random elements or is
impacted in some way by random forces.

• Y = outcome
• 𝑥1 , 𝑥2 ,…, 𝑥𝑘 are a set of predictors or regressor variables.
• 𝛽0 , 𝛽1 , 𝛽2 ,…, 𝛽𝑘 are a set of unknown parameters.
• 𝜀 is the random error term.
Linear models…
• Previous equation is called a linear model because the mean response
is a linear function of the unknown parameters 𝛽0 , 𝛽1 , 𝛽2 ,…, 𝛽𝑘

• Linear regression models are used widely for several reasons. First,
they are natural approximating polynomials for more complex
functional relationships.
Linear models, drawbacks and disadvantages
• Because linear regression models are used so often (and so
successfully) as approximating polynomials, we sometimes refer to
them as empirical models.
• It is straightforward to estimate the unknown parameters 𝛽0 , 𝛽1 , 𝛽2 ,…,
𝛽𝑘 .
• There is a really nice, elegant, and well-developed statistical theory
for the linear model.
Assumptions….
• If we assume that the errors (𝜀) in the linear model are normally and
independently distributed with constant variance, then statistical
tests on the model parameters, confidence intervals on the
parameters, and confidence and prediction intervals for the mean
response can easily be obtained.
Nonlinear models
• Linear regression models often arise as empirical models for more
complex, and generally unknown phenomena.
• However, there are situations where the phenomenon is well
understood and can be described by a mathematical relationship.
• For example: Newton’s cooling law.
Nonlinear model: Newton’s cooling law:
• Newton's law of cooling, which states that the rate of change of
temperature of an object is proportional to the difference between
the object's current temperature and the temperature of the
surrounding environment.
• Thus if f is the current temperature and 𝑇𝐴 is the ambient or
environmental temperature.

• where 𝛽 is the constant of proportionality


• The value of 𝛽 depends on the thermal conductivity of the object and
other factors.
Newton’s cooling law(2)
• The actual temperature of the object at time t is the solution to the
previous equation:

where 𝑇1 is the initial temperature of the object.


• In practice, a person measures the temperature at time t with an
instrument, and both the person and the instrument are potential
sources of variability not accounted for in the previous equation.
Newton’s cooling law(3)
• It is an example of nonlinear model

• The response is not a linear function of the unknown parameter 𝛽.


• Nonlinear models are sometimes called mechanistic models.
• Many nonlinear models are developed directly from the solution to
differential equations, as was illustrated in the Newton’s cooling law.
• There is a statistical theory supporting inference for the nonlinear
model. This theory makes use of the normal distribution, and typically
assumes that observations are independent with constant variance.
Generalised linear models (GLM)
• When dealing with the linear and nonlinear regression models
applied before, the normal distribution played a central role.
• Inference procedures for both linear and nonlinear regression models
in fact assume that the response variable y follows the normal
distribution
• There are a lot of practical situations where this assumption is not
going to be even approximately satisfied.
• Suppose that the response variable is a discrete variable, such as a
count.
Generalised linear models (GLM)(2)
• We often encounter counts of defects or other rare events, such as
injuries, patients with particular diseases, and even the occurrence
of natural phenomena including earthquakes and Atlantic hurricanes.
• Another possibility is a binary response variable.
• Situations where the response variable is either success or failure
(i.e., 0 or 1 ) are fairly common in nearly all areas of science and
engineering.
Challenger accident…
• The space shuttle was made up of the Challenger orbiter, an external
liquid fuel tank containing liquid hydrogen fuel and liquid oxygen
oxidizer, and two solid rocket boosters.
• The cause of the accident was eventually traced to the failure of 0-
rings on the solid rocket booster.
Challenger accident…
• The 0-rings failed because they lost flexibility at low temperatures,
and the temperature that morning was 31 °F, far below the lowest
temperature recorded for previous launches.
Challenger accident… (2)
• It does appear to be some relationship between failure and
temperature, with a higher likelihood of failure at lower
temperatures, but it is not immediately obvious what kind of model
might describe this relationship.
GLM, assumptions and distributions
• A linear regression model does not seem appropriate, because there
are likely some temperatures for which the fitted or predicted value
of failure would either be greater than unity or less than zero, clearly
impossible values.
• This is a situation where some type of generalized linear model is
more appropriate than an ordinary linear regression model.
• There are also situations where the response variable is continuous,
but the assumption of normality is not reasonable.
GLM, assumptions and distributions(2)
• The generalized linear model or (GLM) allows us to fit regression models
for univariate response data that follow a very general distribution called
the exponential family.
• The exponential family includes:
Normal
Binomial
Poisson
Geometric
Negative binomial
Exponential
Gamma
Inverse normal distribution
GLM, components
• If the 𝑦𝑖 , 𝑖 = 1,2, … , 𝑛, represent the response values, then the GLM is
𝑔 𝜇𝑖 = 𝑔 𝐸 𝑦𝑖 = 𝑥𝑖′ 𝛽
where 𝑥𝑖 is a vector of regressor variables or covariates of the i-th
observation and β is the vector of parameters or regression
coefficients.

• Every generalized linear model has three components:


1) Response variable distribution (error structure)
2) A linear predictor that involves the regressor variables or covariables
3) Link function g that connects the linear predictor to the natural mean of
the response variable.
GLM, components
• Consider the linear regression model. The response distribution is
normal, the linear predictor is:

• And the link function is an identity link, g(a) = a, or

• Thus, the standard linear regression model is a GLM.


GLM, example given.
• Depending on the choice of the link function g, a GLM can include a
nonlinear model. For example, if we use a log link, g(a) = ln(a), then

• For the case of a binomial distribution, a fairly standard choice of link


function is the logit link.
• For the Challenger data, where there is a single regressor variable,
this leads to the model:
GLM, example given.
• We will later see that the estimated parameters of the model in that
equation are 𝛽0 and 𝛽1 , 𝑏0 = 10.875 and 𝑏1 = -0.17132. Therefore the
fitted function is:

• This is called a logistic regression model, and it is a very common way


to model binomial response data.
• Notice that the model will not result in fitted values outside the 0-1
range, regardless of the value of temperature.
• The generalized linear model may be viewed as a unification of
linear and nonlinear regression models that incorporates a rich
family of normal and nonnormal response distributions.
• Model fitting and inference can all be performed under the same
framework.
GLM
• Thus while the earliest use of GLMs was confined to the life sciences and
the biopharmaceutical industries, applications to other areas of science
and engineering have been growing rapidly.
• The usual GLM assumes that the observations are independent. There are
situations where this assumption is inappropriate; examples include data
where there are multiple measurements on the same subject or
experimental unit, split-plot and other types of experiments that have
restrictions on randomization, and experiments involving both random and
fixed factors.
• Generalized estimating equations (GEEs) are introduced to account for a
correlation structure between observations in the generalized linear
model.
Example of GLM
• The Transistor Gain Data: The data describe a study in which the
transistor gain in an integrated circuit device between emitter and
collector (hFE) is reported along with two variables that can be
controlled at the deposition process-emitter drive-in time (𝑥1 , in
minutes) and emitter dose (𝑥2 , in ions x 1014 ).
• Fit a GLM, estimate the estandar error of the parameters. Does it lack
of fit? Fit another model.

You might also like