Professional Documents
Culture Documents
Statistics and Probability
Statistics and Probability
A Contingency Table
Suppose X and Y are two categorical variables where X has I categories, and Y has
J categories. In this case, we display the IJ possible combinations of outcomes in a
rectangular table having I rows for the categories of X and J columns for the
categories of Y. Thus, a table in which the cells contain frequency counts of
outcomes is called a contingency table.
We define πij = P[X = i; Y = j] as the probability that (X, Y ) falls in the cell in row
i and column j. The probabilities π ij form the joint distribution of X and Y. Note
that,
I J
∑ ∑ π ij=1
i=1 j=1
Marginal Distributions
1|Page
The marginal distribution of X is π i+¿ ¿, which is obtained by the row sums, that is,
π J
i+¿= ∑ πij ¿ , which represents the ith row total while the marginal distribution of Y is
j=1
I
π + j =∑ π ij , which represents the jth row total. For example, consider a 2 x 2 table
i=1
we have
π i+¿=π 11 +π 12 ¿ , π + j =π 21+ π 22
Suppose we have sample data displayed in a contingency table such as the cross-
classification of belief in afterlife above. The observed values in the table are
represented by nij , that is, the cell counts are denoted by with
I J
n=∑ ∑ nij . Remember that I and J are the total number of rows and columns,
i=1 j=1
Belief in Afterlife
Gender Yes No or Undecided Total
Female n11=¿ 435 n12 =¿ 147 n1 +¿=582¿
Male n21=¿ 375 n22=¿ 134 n2 +¿=509 ¿
Total n+1 =810 n+2 =281 n=1091
2|Page
Example: Sample Proportions
Belief in Afterlife
Gender Yes No or Total
Undecided
Female p11=¿435/1091=0.398 p12=0.135 p1+¿=0.533 ¿
Male p21=0.344 p22=0.123 p2+¿=0.467 ¿
Total p+1=¿ 0.742 p+2=¿ 0.258 p=1.000
Conditional Probabilities
In the table for Belief in Afterlife, the response variable is belief in afterlife (usual
represented by the column variable) and gender is an explanatory variable (usual
represented by the row variable). Now we wish to study the conditional
distribution distributions of belief in the afterlife, given gender.
For females,
Proportion of yes responses =435/582= 0.747
Proportion of no responses = 147/582=0.253
For males,
Proportion of yes responses = 0.737
Proportion of no responses = 0.263
Independence
In general, two variables are statistically independent if all joint probabilities equal
the product of their marginal probabilities, i.e.
π ij =π i+¿ π ¿, for i = 1,…, I and j = 1,…, J.
j+ ¿¿
3|Page
Note that independence means that the conditional distributions of Y are identical
at each levels of X.
Suppose that the row totals are fixed (look at the table for cross-classification of
Aspirin use and Myocardial Infarction below!) and hence we have a binomial
model. Now assume that the two categories of Y are success and failure. Let π 1 be
Probability of success in row 1 and π 2 be Probability of success in row 2. The
difference in probabilities π 1−π 2 compares the success probabilities in two rows.
This measure is used to quantify the strength of association between variables.
4|Page
If the counts in two rows are independent samples, the estimated standard error of
p1 - p2 is given by
√
p1 ( 1− p1 )
σ^ ( p1 −p 2) = ¿
p2 ( 1− p2 )
n1+¿ + ¿
n2+¿
Estimate the
i) difference of proportions,
ii) estimate its standard error
iii) compute the 95% confidence interval for the true difference of proportions.
We let p1 and p2 denote the sample proportions of successes for the two rows. The
sample proportions are computed as
√
p1 ( 1− p1 )
σ^ ( p1 −p 2) = ¿
p2 ( 1− p2 ) 0.02656
n1+¿ + =¿ ¿
n2+¿
( p 1− p 2) ± z α /2 σ^ ( p1− p2 )
where
5|Page
Confidence Interval
( p1− p2 )± z α / 2 σ^ ( p 1− p2 )
where z α / 2 denotes the standard normal percentile having a right tail probability
equals to α/2 .
Inference using the confidence interval for the true difference in proportions
Always check if the constructed interval includes the value 0 (zero) then it means
that there is no difference between the proportions i.e. the two variables are
independent (not associated).
Solution
6|Page
Next, we determine the difference of proportions i.e. p1− p2=0.0077 and the
estimate of its standard error which is
√
σ^ ( p1 −p 2) =
0.0171× 0.9829 0.0094 × 0.9906 0.0015
11034
+
11037
=¿
Therefore, a 95% confidence interval (CI) for the true π 1−π 2 is given by
( p1− p2 )± z α / 2 σ^ ( p 1− p2 )
¿¿
= (0.005, 0.011)
Interpretation
The 95% confidence interval for the true difference of proportions lies between
0.005 and 0.011.
Remark
A difference between two proportions of a certain fixed size may have greater
importance when both proportions are near 0 or 1 than when they are near the
middle of the range. For example, the difference between 0.010 and 0.001 is the
same as the difference between 0.410 and 0.401, namely 0.009 but the first
difference seems more important than the later since ten times as many subjects
have adverse reactions with one drug as the other.
This is another measure that used to quantify the strength of association between
variables.
7|Page
In 2 x 2 tables, the relative risk is the ratio of the success probabilities for the two
groups π 1 / π 2.
The proportions 0.010 and 0.001 has a relative risk of 10.0 whereas the proportions
0.410 and 0.401 have a relative risk of 1.02.
Note that a relative risk of 1.0 occurs when π 1=π 2, indicates that the response is
independent of the group.
exp ¿
The sample relative risk is p1 / p 2= 1.818. This means that the sample proportion of
MI cases was 82% higher for the group taking placebo.
Inference
A large sample 95% confidence interval for the true relative risk π 1 / π 2 is [1.4330,
2.3059].
Interpretation: We are 95% confident that the proportion of MI cases for those
taking placebo is between 1.43 and 1.30 times the proportion of those taking
aspirin.
Note
This confidence interval for the relative risk indicates that the risk of MI is at
least 43% higher for the placebo group.
8|Page
The C.I. (0.005, 0.011) for the difference of proportions, π 1−π 2, makes it seem
as if the two groups differ by a trivial amount, but the relative risk shows that
the difference may have important public health implications. Hence, using the
difference of proportions alone to compare to compare two groups can be
somewhat misleading when the proportions are BOTH close to zero.
It is sometimes informative to compute also the ratio of “failure” probabilities,
(1−π 1 )/(1−π 2 ). This takes a different value than the ratio of the success
probabilities. When one of the two outcomes has small probability, normally
one computes the ratio of the probabilities for that outcome.
This is another measure that used to quantify the strength of association between
variables.
odds1 π 1 / ( 1−π 1)
θ= =
odds2 π 2 / ( 1−π 2)
is called the odds ratio.
Remarks
9|Page
Properties of Odds Ratio (OR)
π 11 /π 12 π 11 π 22
θ= =
π 21 /π 22 π 12 π 21
which is also called cross-product ratio (odds ratio).
The sample odds ratio is equal to the ratio of the sample odds in the two rows,
p1 / ( 1−p 1) n 11/n12 n11 n 22
^
θ= = =
p2 / ( 1−p 2 ) n 21/n 22 n12 n 21
10 | P a g e
Example: Aspirin Use and Myocardial Infarction (MI)
Myocardial Infarction
Group Yes No Total
Placebo 189 10845 11034
Aspirin 104 10933 11037
Solution
For the Placebo group, the odds of MI is 0.0174 and for the Aspirin group, the
odds of MI is 0.0095. Therefore, the sample odds ratio is 0.0174/0.0095 = 1.832,
which is interpreted as “The estimated odds are 83% higher for the placebo group”.
For small to moderate sample size, the distribution of sample odds ratio θ^ is
highly skewed.
For θ=1, θ^ cannot be much smaller that θ, but it can be much larger than θ with
nonnegligible probability.
Consider log odds ratio, log θ
X and Y are independent implies log θ=0.
Log odds ratio is symmetric about zero in the sense that reversal of rows or
reversal of columns changes its sign only.
The sample log odds ratio log θ^ has a less skewed distribution and can be
approximated by the normal distribution well.
The asymptotic standard error of log θ^ is given by
ASE¿
11 | P a g e
Therefore, a large sample confidence interval for θ is given by
exp [ log θ^ ± z α / 2 ASE ( log θ^ ) ]
Construct a 95% confidence interval for the odds ratio, and interpret.
Solution
√ 1
+
1
+
1
+
1
89 10933 10845 104
=0.123
Interpretation
Since the confidence interval for θ does not contain 1.0, it means that the true odds
for MI seem different for the two groups. The interval predicts that the odds of MI
are at least 44% higher for subjects taking placebo than for those taking aspirin.
Notes
The endpoints of the interval are not equally distant from θ^ =1.83, because
the sampling distribution of θ^ is skewed to the right.
12 | P a g e
The sample odds ratio is 0 or ∞ if any n ij = 0, and it is undefined if both
entries in a row or column are zero. In such a case, we consider the slightly
modified formula
~ ( n¿¿ 11+ 0.5)( n22+ 0.5)
θ= ¿
(n¿¿ 12+0.5)(n 21+0.5)¿
In the ASE formula also nij’s are replace by nij +0.5
A sample odds ratio 1.832 does not mean that p1 is 1.832 times p2.
The following simple relation exists between the odds ratio and the relative
risk
p1 / ( 1−p 1 ) 1− p2
Odds Ratio= =Relative Risk ×
p2 / ( 1−p 2 ) 1− p1
If p1 and p2 are close to 0, the odds ratio and relative risk take similar values.
The Aspirin use table illustrates this similarity. For each group, the sample
proportion of MI cases is close to zero. Thus, the sample odds ratio of 1.83
is similar to the sample relative risk of 1.82 as obtained earlier. In such a
case, an odds ratio 1.83 does mean that p1 is about 1.83 times p2.
This relationship between odds ratio and relative risk is useful. For some
dataset calculation of the relative risk is not possible, yet one can compute
the odds ratio and use it to approximate the relative risk. Such studies occur
when the sampling design is retrospective, one can construct conditional
distributions for the explanatory variable, within level of the fixed response.
However, it is not possible to estimate the probability of the response
outcome of interest, or to compute the difference of proportions or relative
risk for that outcome. We illustrate this in the following example.
13 | P a g e
The odds ratio is determined by the conditional distributions in either
direction, and can be computed even if we have a study design that measures
a response on X within each level of Y.
To test H0 that the cell probabilities equal certain fixed values { π ij }. We let nij be the
cell frequencies and n be the total sample size, the values μij =nπ ij are the expected
cell frequencies under H0.
[ ]
2
I J
( nij −μij )
χ =∑ ∑ .
2
i=1 j =1 μij
Properties of x 2
This statistic takes its minimum value of zero when all nij =μ ij. For a fixed sample
size, greater differences between { n ij } and { μij } produce larger χ 2 values and stronger
evidence against H0. The χ 2 statistic has approximately a chi-squared distribution
with appropriate degrees of freedom for large sample sizes.
Likelihood-Ratio Test
Likelihood ratio
This Likelihood ratio cannot exceed 1. If the maximized likelihood is much larger
when the parameters are not forced to satisfy H0, then the ratio Λ is far below 1 and
there is strong evidence against H0. Hence, small likelihood ratio implies deviation
from H0.
( )
nij
J I
G =2 ∑ ∑ n ij log
2
The Pearson χ 2 and likelihood-ratio G2 statistics have the same large sample
distribution under null hypothesis. They commonly yield the same conclusions.
Tests of Independence
Usually, {π i +¿}¿ and {π + j } are unknown as this expected value. We estimate the
expected frequencies by substituting sample proportions for the unknown
probabilities, leading to
^μij =np ni+¿ n+ j ni+¿n
i+¿ p+ j =n = +j
¿¿ ¿
n n n
[ ]
2
I J
( nij −^μij )
χ 2=∑ ∑ .
i=1 j =1 ^μij
( )
nij
J I
G2=2 ∑ ∑ n ij log
j=1 i=1 ^μij
both of them have large sample chi-squared distribution with (I -1)(J - 1) degrees
of freedom.
15 | P a g e
Female 165 47 191 403
Total 444 120 416 980
Use the Pearson’s Chi-square and Likelihood ratio to test whether Party
Identification and Gender are associated.
The test statistics are χ 2=7.01 and G2 = 7.00 with degrees of freedom = (I -1)(J - 1)
=(2-1)(3-1)=2.
Each statistic has P-value = 0.03. Therefore, the above test statistics suggest that
Party Identification and Gender are related.
Residuals
A test statistic and its P-value simply describe the evidence against the null
hypothesis. In order to understand better the nature of evidence against H 0, a cell
by cell comparison of observed and estimated frequencies is necessary.
Remarks
Pearson's χ 2 tests only indicate the degree of evidence for an association, but
they cannot answer other questions like nature of association and so on.
16 | P a g e
These χ 2 tests are not always applicable. We need large datasets to apply
n
them. Approximation is often poor when (IJ ) < 5.
The values of χ 2 or G2 do not depend on the ordering of the rows. Thus we
ignore some information when there is ordinal data.
For ordinal data, it is important to look for types of associations when there is
dependence. It is quite common to assume that as the levels of X increases,
responses on Y tend to increases or responses on Y tends to decrease toward higher
levels of X.
The most simple and common analysis assigns scores to categories and measures
the degree of linear trend or correlation. The method used is known as “Mantel-
Haenszel Chi-Square" test (Mantel and Haenszel 1959).
Suppose u1 ≤ u2 ≤…≤ uI denotes scores for the rows, and let v1 ≤ v2 ≤…≤ vJ denote
scores for the columns. The scores have the same ordering as the category levels.
Define the correlation between X and Y as
I J
r =∑ ∑ ui v j nij −¿¿ ¿ ¿
i=1 j=1
Independence between the variables implies that its true value equals zero.
The larger the correlation is in absolute value, the further the data fall from
independence in this linear dimension.
The test statistics are χ 2=12.1 , d.f.=4, p-value = 0.02 and G2 = 6.2, d.f. = 4, P-value
= 0.19. These two tests provide mixed signals.
The percent present and adjusted residuals suggest that there may be a linear trend.
Remarks
The correlation r has limited use as a descriptive measure of tables.
18 | P a g e
Different choices of monotone scores usually give similar results.
However, it may not happen when the data are very unbalanced, i.e. when
some categories have many more observations than other categories.
If we had taken (1, 2, 3, 4, 5) as the row scores in our example, then M2 =
1.83 and p-value = 0.18 gives a much weaker conclusion.
It is usually better to use one's own judgment by selecting scores that reflect
distances between categories.
When the sample size is small, one can perform inference using exact distributions
rather than larger sample approximations. This section illustrates such exact
inference for two-way contingency tables.
We need to test whether she can tell accurately. This is equivalent to test H0:θ=1
against H1 : θ > 1.
We cannot use previously discussed tests as we have a very small sample size.
When θ = 1, the probability of a particular value n11 for that count equals
19 | P a g e
( n1 +¿¿ n11 ) ( n2 +¿¿ n+1−n11 )
p ( n11 )=
(nn )
+1
The outcomes at least as favorable as the observed data is n11 = 3 and 4 given the
row and column totals.
Hence,
p ( 3 )=
( 3 )( 1 ) 16
4 4
= =0.229
( 4) 70
8
p ( 4 )=
( 4 )( 0 ) 1
4 4
= =0.0143
( 4 ) 70
8
In this section, the main objective will be analyzing association between two
categorical variables X and Y, while controlling other covariates that can influence
20 | P a g e
the association. Thus, we study the effect of X on Y by fixing such covariates
constant. In other words, study the association between X and Y given the levels of
Z.
Partial Tables
These are simply two-way tables between X and Y at separate levels of Z. The two-
way contingency table obtained by combining the partial tables is called the X - Y
marginal table.
Each cell count in the marginal table is a sum of counts from the same cell location
in the partial tables.
The marginal table, rather than controlling Z, ignores it and does not contain any
information about Z.
The associations in partial tables are called conditional associations, because they
refer to the effect of X on Y. Conditional associations in partial tables can be quite
21 | P a g e
different from associations in marginal tables. In fact, it can be misleading to
analyze only a marginal table of a multi-way contingency table as the following
example illustrates.
Simpson's Paradox
This death penalty data is an example of Simpson's paradox. The result that a
marginal association can have different direction from the conditional associations
is called Simpson’s paradox. This result applies to quantitative as well as
categorical variables.
Odds ratios
Consider 2x2xK tables, where K denotes the number of levels of a control variable
Z. Let { n ij } denote the observed frequencies and let { μij } denote their expected
frequencies. Within a fixed level k of Z,
μ11 k μ 22k
θ XY (k) =
μ12 k μ 21k
22 | P a g e
θ XY =μ μ22+¿ ¿
11+¿ ¿
μ12+ ¿μ ¿ ¿
21 +¿
Similar formulas with μijk substituted by nijk ' s provide sample estimates of θ XY (k) and
θ XY .
θ^ XY =n n22+ ¿ ¿
11+¿ ¿
n12+¿ n ¿ ¿
21 +¿
Using the example data above, Sample Conditional Odds Ratio for Victim's race:
53 ×37
White, θ^ XY (1)= 414 ×11 =0.43 , which means that the sample odds for white
defendants receiving the death penalty were 43% of the sample odds for black
defendants.
0 ×139
Black, θ^ XY (2)= 16 ×11 =0.0, since the death penalty was never given to white
defendants having black victims.
53 ×176
Sample marginal odds ratio is θ^ XY = 430 ×15 =1.45, which means the sample odds of
the death penalty were 45% higher for white defendants than for a black defendant.
However, we have just observed that those odds were smaller for a white
defendant for a black defendant, within each level of the victim’s race. This
reversal in the association when we control for victim’s race illustrates Simpson’s
paradox.
23 | P a g e
We consider the true relationship between X and Y, controlling for Z. If X and Y are
independent in each partial table, then X and Y are said to be conditionally
independent given Z. All conditional odds ratios between X and Y are then equal to
1. Conditional independence of X and Y, given Z, does not imply marginal
independence of X and Y. That is, odds ratios between X and Y equal to 1 at each
level of Z, the marginal odds ratio may differ from 1.
18× 8
θ XY (1)= =1.0
12× 12
2 ×32
θ XY (2)= =1.0
8 ×8
These show that given clinic, response and treatment are conditionally
independent.
20× 40
θ XY = =2.0
20 ×20
which means that the variables are not marginally independent. Why are the odds
of success twice as high for treatment A as treatment B when we ignore clinic?
24 | P a g e
It is misleading to study only the marginal tables, concluding that successes are
more likely with treatment A than with treatment B. Subjects within a particular
clinic are likely to be more homogeneous than the overall sample, and response is
independent of treatment in each clinic.
Homogeneous Association
Cochran-Mantel-Haenszel Test
In the kth partial table, the row totals are n1+k, n2+k and column totals are n+1k, n+2k.
Given both these totals, n11k has a hypergeometric distribution and that determines
all other cell counts in the kth partial table.
Under the null hypothesis of independence, the mean and variance of n11k are
25 | P a g e
n 1+k n+1 k
μ11 k =Ε ( n 11k ) =
n++k
[∑ ( ]
K 2
n11 k −μ11 k )
k=1
CMH= K
∑ Var ( n11k )
k=1
CMH takes larger values when (n11 k −μ11 k) is consistently positive or consistently
negative, rather than positive for some and negative for others. This test is
inappropriate when the association varies widely among the partial tables. It works
best when the X-Y association is similar in each partial table.
26 | P a g e
Nonsmokers 121 215
Zhengzhou Smokers 182 156 1.59 169.0 28.3
Nonsmokers 72 98
Taiyuan Smokers 60 99 2.37 53.0 9.0
Nonsmokers 11 43
Nanchang Smokers 104 89 2.00 96.5 11.0
Nonsmokers 21 36
In this example
So, CMH = (2930 - 2562.5)2/482.1 = 280.1 with d.f. = 1. Since there is a huge
discrepancy between the value 280.1 and the d.f. this suggests there is extremely
strong evidence against conditional independence.
Now we wish to combine the sample odds ratios from the K partial tables into a
single summary measure of partial association. Assume, homogeneous association,
that is,
H 0 :θ XY (1)=θ XY (2)=…=θ XY ( K)
(∑ )
K 2
27 | P a g e
K
+ ∑ [ ( n11 k +n22 k ) ( n12 k n21 k )+(n 12k +n 21k )(n11 k n22k ) ] /n2++k
k=1
(∑ )(∑ )
K K
2 n11 k n22 k /n++k n12 k n21k /n ++ k
k=1 k=1
K
+ ∑ ( n 12k +n 21k ¿(n12 k n21 k )¿¿ n2++ k )
k=1
(∑ )
K 2
Using our example, we find that the MH estimator is θ^ MH = 2.17, the estimated
standard error is
σ^ ( log ( θ^ MH ) )=0.046
A 95% C.I. for common log odds ratio is 0.777 ± (1.96)(0.046) = (0.686, 0.868).
Therefore, a 95% C.I. for common odds ratio is (e0.686, e0.868) = (1.98 2.38).
Note that if the true odds ratios are not identical but do not vary drastically, θ^ MH
still provides a useful summary of the K conditional associations. Similarly, the
CMH test is a powerful summary of evidence against the hypothesis of conditional
independence, as long as the sample associations fall in same direction.
data: lungtab
Mantel-Haenszel X-squared = 280.1375, df = 1, p-value < 2.2e-16
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
1.984002 2.383249
sample estimates:
common odds ratio
2.174482
28 | P a g e
Now we test the hypothesis that the odds ratio between X and Y is the same at each
level of Z, (testing for homogeneous association in 2 x 2 x K tables), the null
hypothesis,
H 0 :θ XY (1)=θ XY (2)=…=θ XY ( K)
where ^μijk is the expected cell frequency. The formula for computing ^μijk is
complicated.
Under null hypothesis, the Breslow-Day statistic has a large sample chi-squared
distribution with degrees of freedom K - 1. The sample size should be relatively
large in each partial table, with { ^μijk ≥ 5 } in at least about 80% of the cells. It can be
computed using standard statistical software.
We can also estimate the parameters, which describe the effects in a more
informative way.
29 | P a g e
Example: Challenger O-ring
For the 23 space shuttle flights that occurred before the Challenger mission disaster
in 1986, the following table shows the temperature at the time of flight and
whether at least one primary O-ring suffered thermal distress.
The Data
We wish to test whether there was any association between temperature and
thermal distress. This will be discussed later!
30 | P a g e
Clearly, the fit from linear regression has a major structural defect, that is,
probabilities fall between 0 and 1, whereas linear functions take values over the
entire real line. Hence, the linear regression predicts probabilities below 0 and
above 1 for sufficiently large or small temperature (x) values.
31 | P a g e
The logistic function seems to fit this data decently.
32 | P a g e
Example: Horseshoe Crabs and Satellites
We shall analyze some data from a study of nesting horseshoe crabs, in which each
female horseshoe crab in the study had a male crab attached to her in her nest. The
study investigated factors that affect whether the female crab had any other males,
called satellites, residing nearby her.
Explanatory variables thought possibly to affect this included the female crab's
color, spine condition, weight, and carapace width. The response outcome for each
female crab is her number of satellites.
For now we consider the width alone as a predictor of the response. To obtain a
clearer picture of overall trend, we grouped the female crabs into a set of width
categories, (≤23.25, 23.25-24.25, 24.25-25.25, 25.25-26.25, 26.25-27.25,27.25-
28.25,28.25-29.25, >29.25), and computed sample mean number of satellites for
female crabs in each category.
1. Random component
2. Systematic component
3. Link
Describes the functional relation between the systematic component and expected
value (mean) of the random component
Random Component
Let Y1, … , YN denote the N observations on the response variables Y. The random
component specifies a probability distribution for Y1,… , YN.
If the potential outcome for each observation Yi are binary such as "success" or
"failure"; or, more generally, each Yi might be number of "successes" out of a
33 | P a g e
certain fixed number of trials, we can assume a binomial distribution for the
random component.
Systematic Component
This specifies the variables that play the roles of xj in the formula
α + β 1 x 1+ …+ β k x k
Link
It specifies how the mean μ relates to explanatory variables in the linear predictor.
The model formula states that
g ( μ )=α + β 1 x 1 +…+ β k x k
Log link g ( μ )=log ( μ)=α + β 1 x 1+ …+ β k x k , which models the log of the mean.
This log function applies to positive numbers such as count data.
34 | P a g e
[ ]
μ
Logit g ( μ )=log 1−μ =α + β1 x1 +…+ β k x k , which models the log of an odds. A
GLM that uses the logit link is called a logit model.
Each potential probability distribution for the random component has one special
function of the mean called its natural parameter.
For the normal distribution, it is mean itself.
For the Poisson, the natural parameter is the log of the mean.
For the Binomial, the natural parameter is the logit of the success
probability.
The link function that uses the natural parameter as g( μ) in the GLM is called the
canonical link. Though other links are possible, in practice the canonical links are
most common
Example
35 | P a g e
Snoring Heart Disease Proportion Linear Logit Probit
Yes No Yes Fit Fit Fit
Never 24 1355 0.017 0.017 0.021 0.020
Occasional 35 603 0.055 0.057 0.044 0.046
Nearly every night 21 192 0.099 0.096 0.093 0.095
Every night 30 224 0.118 0.116 0.132 0.131
We use (0,2,4,5) for the snoring categories, treating the last two levels closer than
the other adjacent pairs. See R code for solving this problem in the resource.
Linear Fit using maximizing likelihood: ^π (x) = 0.0172 + 0.0198x. For a nonsnore
(x=0), the estimated proportion of subjects having heart disease is ^π (0) = 0.0172 +
0.0198(0)=0.0172.
36 | P a g e
Note that the linear fit is slightly different.
Relationship between π (x) and x are usually nonlinear rather than linear. The most
important function describing this nonlinear relationship has the form
log
( 1−ππ (x)(x) )=α + βx , which is called the logistic regression function.
These models are often referred as logit models as the link in this GLM is the logit
link.
The parameter β determines the rate of increase or decrease of the logistic curve.
When β >0 , this is an indication that π ( x ) increases as x increases. When β <0 , π ¿x)
decreases as x increases.
The magnitude of β determines how fast the curve increases or decreases. As |β|
increases, the curve has a steeper rate of change.
Probit Models
The probability of success, π (x), has the form Φ ( α + βx ) where Φ is the cdf of a
standard normal distribution N(0, 1).
The probit transform maps π (x) so that the regression curve for π ( x ) ¿ when β <0 has
the appearance of the normal cdf with mean μ=−α /β and standard deviation σ =1/|β|
.
Using the snoring and heart disease data, the probit model with score {0,2,4,5} for
snoring levels, is given by
Probit [ π^ ( x ) ] =−2.061+ 0.188 x
Note that the fitted proportions are obtained from the standard normal tables. For
example, at snoring level x = 0, the fitted probit equals −2.061+0.188 ( 0 ) =−2.06,
which has the corresponding probability of 0.020 from the standard normal table.
37 | P a g e
Also, at snoring level x = 5, the fitted probit equals −2.061+0.188 ( 5 )=−1.12, which
corresponds to a fitted probability of 0.131 as per the standard normal table.
Poisson Regression
The GLM assume a Poisson distribution for the random component. Like counts,
Poisson variates can take any nonnegative integer value. Thus, the Poisson
distribution has a positive mean.
One can model the Poisson mean using the identity link, but it is more common to
model the log of the mean. A Poisson loglinear model is a GLM that assumes a
Poisson distribution for Y and uses the log-link.
x
For this model we have the expression μ=exp ( α+ βx )=eα ( e β ) . This shows that a one-
unit increase in X has a multiplicative impact of e β on µ. Thus, the mean of Y at
x+1 equals the mean of Y at x multiplied by e β .
It is often relevant to certain types of events which occur over time, space, or some
other index of size, to model the rate at which events occur.
In modeling numbers of auto thefts in 2011 for a sample of cities, we could form a
rate for each city by dividing the number of thefts by the city's population size.
38 | P a g e
The model describes how the rate depends on some other explanatory variables
such as the city’s unemployment rate, median income, and percentage of residents
having completed high school.
When a response count Y has index (such as population size) equal to t, the sample
rate of outcomes is Y/t. The expected value of the rate is μ/t . A loglinear model for
the expected rate has form
The adjustment term, -log t, to the above equation is called an offset. Statistical
software can fit models having offsets.
39 | P a g e
MODEL INFERENCE AND MODEL CHECKING
Deviance
This is a measure that summarizes the fit of GLMs. A saturated model has a
separate parameter for each observation giving a perfect fit.
~
Let θ denote the estimate of θ for the saturated model, corresponding to estimated
means ~μi = y ifor all i.
Let θ denote the MLE of θ for the model under consideration.
The deviance of the fitted model is defined as
D ( y;~
μ )=−2 [ L ( ~
μ ; y )−L( y ; y) ]
N N
~ ~
[ ]
¿ 2 ∑ y i θi−b ( θi ) /a ( ∅ ) −2 ∑ [ y i θ^ i ] / a ( ∅ )
i=1 i=1
Deviance
Usually, a (∅) has the form a ( ∅ )=∅ / wi and this statistic equals
N
~ ~ ~ ~
[ ]
2 ∑ w i y i ( θi −θi ) −b ( θi ) + b ( θi ) /∅
i=1
When a model with log-link contains an intercept term, the deviance simplifies to
40 | P a g e
N
D ( y ; μ^ )=2 ∑ y i log ( y i / ^μi )
i=1
Deviance Residuals
Define
~
[
d i=2 w i y i ( θ i−θ^ i ) −b ( θ^ i ) +b ( μ^ i ) ]
The deviance residual for observation i is
√ d i × sign ( y i− ^μi )
41 | P a g e
run;
Some R Codes
snoring<-c(0,2,4,5)
disease<-c(24,35,21,30)
total<-c(1379,638,213,254)
glm(cbind(disease, total-disease) snoring,
family=binomial(link="logit"))
glm(cbind(disease, total-disease) snoring,
family=binomial(link=probit"))
Reference: McCullagh, P. and Nelder, J.A.
(1989).Generalized Linear Models. 2nd ed.
London: Chapman and Hall.
42 | P a g e