Big Data Statistics, meeting 2: Other

regression models or generalized linear


8 February 2024
Binomial regression (cont’d)
■ If we want to keep the approach of relating p to
βi xi ∈ (0, 1),

we have basically two equivalent possibilities

a) We use a function h that maps i=1 βi xi to the interval (0,1);
b) Or we apply a function g to p and model g(p) by i=1 βi xi .
■ As said they are equivalent. Which one do you find more convenient/natural?
■ I find possibility a) more natural but most of the literature presents the model with
possibility b).
Pd Pd
■ Note that h( i=1 βi xi ) = p and g(p) = i=1 βi xi imply: h = g −1 .
■ Functions with the property in a) are called response functions and functions with
the property in b) are called link functions. The names given to these functions in a
particular application are based on possibility b).
■ Let us look at same examples for response and link functions (the names always
stem from the link function):
Binomial regress.: estimation (cont’d)
  y1   1−y1
Xd Xd
f (β1 , . . . , βd ) = Φ  βj x1j  1 − Φ  βj x1j 
j=1 j=1
  yn   1−yn
X d
× Φ  βj xnj  1 − Φ  βj xnj 
j=1 j=1
  yi   1−yi
Y d
X d
= Φ  βj xij  1 − Φ  βj xij  ,
i=1 j=1 j=1

where yi ∈ {0, 1}, i = 1, . . . , n.

■ This might look scary at first glance but it is just the ’conventional’
Qn yi likelihood for
independent Zi ∼ Bernoulli(p ), 1 ≤ i ≤ n, which equals p (1 − p ) 1−yi
P i  i=1 i i
where we replaced pi by Φ j=1 βj xij .

Binomial regress.: estimation (cont’d)
In compact form the log likelihood equals
  
X d
log(f (β1 , . . . , βd )) = yi log Φ  βj xij 
i=1 j=1
  
X d
+ (1 − yi ) log 1 − Φ  βj xij  .
i=1 j=1

Differentiating w.r.t. βk , k = 1, . . . , d, we have (with φ = Φ′ )

 
n d
∂ log(f (β1 , . . . , βd )) X yi X
= P  φ βj xij  xik
∂βk d
i=1 Φ j=1 βj xij j=1
   
n d
X 1 − yi X
+ P  −φ  βj xij  xik  .
i=1 1 − Φ j=1 βj xij j=1

Generalized linear models (cont’d)
The three elements of a generalized linear model are:
1. The distribution of the response variable Y has a probability density (pdf) or
probability mass function (pmf) fθY , θ ∈ Θ, of the type
Y yθ − b(θ)
fθ (y) = exp − c(ψ, y) , y ∈ D,

where θ ∈ Θ and ψ are real-valued parameters, D is the support of the distribution

of Y , and b and c are real-valued functions.
2. A linear function of the explanatory variable x, i.e. j=1 βj xj ;
3. A function h that links j=1 βj xj (linear predictor) and the expected value E[Y ]
of the response.
Two often used properties that follow from the above form of the pmf or pdf are
(i) E[Y ] = b′ (θ) (b′ denotes the first derivative of b); and
(ii) Var[Y ] = ψb′′ (θ) (b′′ denotes the second derivative of b).
For the rest we assume ψ which is called dispersion parameter to be known.
Generalized linear models (cont’d)
Example (Bernoulli as GLM): Let us check that the binomial regression model
belongs to the generalized linear models class.
1. The pmf of a Bernoulli random variable Y is

py (1 − p)(1−y) , y ∈ {0, 1}.

We can rewrite this as

exp(log(p) · y + log(1 − p) · (1 − y)) = exp(log(p/(1 − p)) · y + log(1 − p))). (3)

This is exactly of the form on the previous slide with

θ = log(p/(1 − p)), ψ = 1, c(ψ, y) = 0, and b(θ) = − log(1/(1 + exp(θ))).

Note that b is a function of θ and to find it here we rewrote p in terms of θ and

plugged this into log(1 − p).
2. This element is always the same (we only need regressors which are used as input
for a linear function).

Generalized linear models (cont’d)
Example (Bernoulli as GLM (cont’d)): Finally, let us look at the third element of
GLMs for the Bernoulli distribution.
3. The expectation of the random variable Y if its probability mass function is given
in the form (3) equals
E[Y ] = .
1 + exp(θ)
Linking this expectation to the linear function j=1 βj xij by using h it reads as
 
X exp(θ)
h βj xij  = .
1 + exp(θ)

Taking h to be the probit link this becomes

 
X exp(θ)
Φ βj xij  = .
1 + exp(θ)

Distribution theory GLMs
Above we derived the maximum likelihood estimators for a binomial and Poisson
regression. Here we look at their asymptotic properties by means of the theory for
■ Let Y1 , . . . , Yn be independent, each with covariate vector Xi = (Xi1 , . . . , Xi1 )
and each with pmf or pdf given by
yi θi − b(θi )
exp − c(ψ, yi ) , yi ∈ D,
Pd Pd
where θi is related to j=1 βj xij by θi = (b′ )−1 (h( j=1 βj xij )).
■ The likelihood then equals
n Pd Pd !
Y yi (b′ )−1 (h( j=1 βj xij )) − b((b′ )−1 (h( j=1 βj xij )))
exp − c(ψ, yi ) .

■ The MLE β̂ = (β̂1M LE , . . . , β̂dM LE ) is obtained by differentiating the log of
this expression w.r.t. β1 , . . . , βd .

Distribution theory GLMs
We have the following result
■ Theorem Under some regularity conditions1 we have
◆ β M LE is consistent;
◆ Sn (β M LE − β) has approximately a N (0, Id×d ) distribution. Here Sn is the
square root of
(XnT Ŵn Xn ),
where Ŵn is an n × n diagonal matrix with
h′ (β̂1M LE xi1 + . . . + β̂dM LE xid )
wii = ,

with σ̂i2 an estimator of the variance of Yi ; see Exercise sheet 2.

see Fahrmeir and Kaufmann (1985)

Distribution theory GLMs
It is very instructive to compare the result on the previous slide with the results we
discussed in Lecture 1.
■ For the normal distribution as said above we have h(x) = x so that h′ (x) = 1.
Then the matrix Ŵn becomes the identity matrix multiplied by 1 divided by an
estimator for σ 2 . This is exactly what we had in Lecture 1.
■ The only difference between the result on the previous slide and the results of
Lecture 1 is the appearance of Ŵn which has its roots in the use of a link function.
Here as in Lecture 1 the distribution on the previous slide provides a way to test, for
H0 : βℓ = 0 against H1 : βℓ 6= 0.
based on the t-statistic β̂ℓ − 0
Tℓ = q ,
(XnT Ŵn Xn )−1 )ℓ ℓ

where ((XnT Ŵn Xn )−1 )ℓ ℓ is the (ℓ, ℓ) element of the matrix (XnT Ŵn Xn )−1 .


