Pertemuan 7 - New

Pemilihan Model Terbaik
Analisis Data Kategorik

Pertemuan VII
Model Log Linier Hirarki
• Pembentukan model log linier dibatasi dengan prinsip
hirarki, yakni jika suatu faktor interaksi dipakai sebagai
variabel bebas, maka setiap faktor utama yang
membentuk faktor interaksi tersebut selayaknya harus
dipakai sebagai variabel bebas dalam model
• Contoh: jika model mengandung 𝜆𝑖𝑗 𝑋𝑌 , maka model

𝑋 𝑌
tersebut juga harus mengandung 𝜆𝑖 + 𝜆𝑗 .
• Model log linier untuk tabel tiga dimensi adalah:
log 𝑚𝑖𝑗𝑘 = 𝜇 + 𝜆𝑖 𝑋 + 𝜆𝑗 𝑌 +𝜆𝑘 𝑍 +𝜆𝑖𝑗 𝑋𝑌 + 𝜆𝑖𝑘 𝑋𝑍 +𝜆𝑗𝑘 𝑌𝑍 +𝜆𝑖𝑗𝑘 𝑋𝑌𝑍
merupakan model lengkap (saturated model)

dengan subscipt ganda yang menunjukkan asosiasi
parsial, sedangkan subscipt triple menunjukkan tiga
faktor interaksi
Beberapa model Log linier untuk
Tabel Tiga Dimensi*
Model Log linier Symbol

log 𝑚𝑖𝑗𝑘 = 𝜇 + 𝜆𝑖 𝑋 + 𝜆𝑗 𝑌 +𝜆𝑘 𝑍 (X, Y, Z)
log 𝑚𝑖𝑗𝑘 = 𝜇 + 𝜆𝑖 𝑋 + 𝜆𝑗 𝑌 +𝜆𝑘 𝑍 +𝜆𝑖𝑗 𝑋𝑌 (XY, Z)
log 𝑚𝑖𝑗𝑘 = 𝜇 + 𝜆𝑖 𝑋 + 𝜆𝑗 𝑌 +𝜆𝑘 𝑍 +𝜆𝑖𝑗 𝑋𝑌 +𝜆𝑗𝑘 𝑌𝑍 (XY, YZ)
log 𝑚𝑖𝑗𝑘 = 𝜇 + 𝜆𝑖 𝑋 + 𝜆𝑗 𝑌 +𝜆𝑘 𝑍 +𝜆𝑖𝑗 𝑋𝑌 +𝜆𝑗𝑘 𝑌𝑍 +𝜆𝑖𝑘 𝑋𝑍 (XY, YZ, XZ)
𝑋 𝑌 𝑍 𝑋𝑌 𝑌𝑍 𝑋𝑍 𝑋𝑌𝑍 (XYZ)
log 𝑚𝑖𝑗𝑘 = 𝜇 + 𝜆𝑖 + 𝜆𝑗 +𝜆𝑘 +𝜆𝑖𝑗 +𝜆𝑗𝑘 + 𝜆𝑗𝑘 + 𝜆𝑖𝑗𝑘
* Model log linier dalam tabel di atas merupakan model hirarki

The Data
In the first group of 71 subjects with abnormal electrocardiograms,
out of the 57 overweight subjects 47 were smokers and 10 non-
smokers. Amongst the 14 with normal weights 8 were smokers and
6 non-smokers.
In the second group of 105 with normal electrocardiograms out of

the 40 overweight subjects 25 were smokers and 15 non-smokers.
Amongst the 65 with normal weights 35 were smokers and 30 non-
smokers.
The investigators wish to assess the contribution that overweight

and smoking make to coronary artery disease.
Uji Hipotesis untuk Tabel Tiga Dimensi
ECG BMI SmOKE COUNT
1 1 1 47
1 1 2 10
1 2 1 8
1 2 2 6
2 1 1 25
2 1 2 15
2 2 1 35
2 2 2 30
• ECG= Electrocardiograms (1=tidak normal,2= normal)

• BMI = Body Mass Index (1=overweight,2=normal)
• SMOKE = (1=perokok, 2=bukan perokok)
The Data - Coding
ECG 1= Abnormal (electrocardiogram)
2= Normal
BMI 1= Overweight (body mass index)

2= Normal weight
Smoke 1= Smoker
2= Non-smoker
Initial Analysis
We first perform a simple cross-tabulation to check whether
the frequencies per each cell are adequate to allow log-linear
analysis.
Since only summary data is available use
Data > Weight cases

Weight cases by frequency Count
Analyze > Descriptive statistics > Crosstabs

Select ECG for the rows, BMI for the columns and Smoke for
the layer, finally under Statistics select chi-squared.
Initial Analysis
ECG * BMI * SMOKE Crosstabulation
BMI
SMOKE 1 2 Total
1 ECG 1 Count 47 8 55
Expected Count 34.4 20.6 55.0
2 Count 25 35 60
Total Count 72 43 115

Raw data & Expected Count 72.0 43.0 115.0
expected values. 2 ECG 1 Count 10 6 16
2 Count 15 30 45
Total ECG 1 Count 57 14 71
2 Count 40 65 105

Initial Analysis
Chi-Square Tests
SMOKE Value df
Asymp. Sig. (2-
sided)
a. 0 cells (.0%) have expected
1 Pearson Chi-Square 23.503
a
1 .000 count less than 5. The
Continuity Correction b
21.670 1 .000
minimum expected count is
Likelihood Ratio 24.906 1 .000
Fisher's Exact Test
20.57.
Linear-by-Linear Association 23.298 1 .000 b. Computed only for a 2x2
N of Valid Cases 115
table
2 Pearson Chi-Square 4.151c 1 .042
Continuity Correction
b
3.033 1 .082
c. 0 cells (.0%) have expected
Likelihood Ratio 4.113 1 .043 count less than 5. The
Fisher's Exact Test
Linear-by-Linear Association 4.083 1 .043
minimum expected count is
N of Valid Cases 61 6.56.
Total Pearson Chi-Square
b
30.472d 1 .000
d. 0 cells (.0%) have expected
Continuity Correction 28.791 1 .000
count less than 5. The
Fisher's Exact Test minimum expected count is
31.87.
Initial Analysis
Chi-Square Tests
Asymp. Sig. (2-

SMOKE Value df sided)
a
1 Pearson Chi-Square 23.503 1 .000
b
From the results we infer that among
Fisher's Exact Test both smokers and non-smokers there is
Linear-by-Linear Association 23.298 1 .000 an association between being overweight
c
and an abnormal electrocardiogram.
2 Pearson Chi-Square 4.151 1 .042
b
Likelihood Ratio 4.113 1 .043 How much is the extent of the interaction
Fisher's Exact Test between an abnormal electrocardiogram,
smoking and being overweight?
N of Valid Cases 61
d
Total Pearson Chi-Square 30.472 1 .000
b
Fisher's Exact Test

Full Analysis
This question is better answered by log-linear analysis as
shown below:
Analyze > Loglinear > Model Selection
Select BMI, ECG and Smoking as the factors, do not forget to

define the ranges [1,2] in each case. Then proceed
Hierarchical Loglinear
Analysis - Design 1
Cell Counts and Residuals
Observed Expected Std.

a
SMOKE BMI ECG Count % Count % Residuals Residuals
1 1 1 47.500 27.0% 47.500 27.0% .000 .000
2 25.500 14.5% 25.500 14.5% .000 .000
2 1 8.500 4.8% 8.500 4.8% .000 .000

2 35.500 20.2% 35.500 20.2% .000 .000
2 1 1 10.500 6.0% 10.500 6.0% .000 .000
2 15.500 8.8% 15.500 8.8% .000 .000
2 1 6.500 3.7% 6.500 3.7% .000 .000
2 30.500 17.3% 30.500 17.3% .000 .000
a. For saturated models, .500 has been added to all observed cells.
The output commences with information about the number of

cases, the factors and their levels.
A hierarchical model is being fitted. In a hierarchical model it is

sufficient to list the highest order terms. This is called “generating
class” of the model.
K-Way and Higher-Order Effects
Likelihood Ratio Pearson

K df Chi-Square Sig. Chi-Square
K-way and Higher Order 1 7 69.822 .000 68.727

a
Effects 2 4 44.530 .000 46.724
3 1 1.389 .239 1.421

b
K-way Effects 1 3 25.292 .000 22.004
2 3 43.142 .000 45.303

3 1 1.389 .239 1.421
The likelihood ratio chi-square with no parameters and only the mean is
69.822. The value for the first order effect is 44.530. The difference 69.822 −
44.530 = 25.292 is displayed on the first line of the next table.
The difference is a measure of how much the model improves when first order
effects are included. The significantly small P value (0.0000) means that the
hypothesis of first order effect being zero is rejected. In other words there is a
first order effect.


a
Effects 2 4 44.530 .000 46.724
3 1 1.389 .239 1.421

b
K-way Effects 1 3 25.292 .000 22.004
2 3 43.142 .000 45.303

3 1 1.389 .239 1.421
Similar reasoning is applied now to the question of second order effect. The
addition of a second order effect improves the likelihood ratio chi-square by
43.142. This is also significant. But the addition of a third order term does not
help. The P value is not significant.


a
Effects 2 4 44.530 .000 46.724
3 1 1.389 .239 1.421

b
K-way Effects 1 3 25.292 .000 22.004
2 3 43.142 .000 45.303

3 1 1.389 .239 1.421
In log-linear analysis the change in the value of the likelihood ratio chi-square
statistic when terms are removed (or added) from the model is an indicator
of their contribution. We saw this in multiple linear regression with regard to
R2.
The difference is that in linear regression large values of R2 are associated
with good models. Opposite is the case with log-linear analysis. Small values
of likelihood ratio chi-square mean a good model.
Backward Elimination Statistics
Step Summary
a c
Step Effects Chi-Square df
b
0 Generating Class SMOKE*BMI*E .000 0
CG
Deleted Effect 1 SMOKE*BMI*E 1.389 1 The purpose here is to find the
CG unsaturated model that would
b
1 Generating Class SMOKE*BMI, 1.389 1
provide the best fit to the data.
SMOKE*ECG,
BMI*ECG
Deleted Effect 1 SMOKE*BMI 3.080 1 This is done by checking that the
2 SMOKE*ECG 3.505 1 model currently being tested
3 BMI*ECG 27.631 1 does not give a worse fit than its
b
2 Generating Class SMOKE*ECG, 4.469 2
predecessor.
BMI*ECG
Deleted Effect 1 SMOKE*ECG 7.968 1
2 BMI*ECG 32.094 1
b
BMI*ECG
Step Summary
a c
b
CG As a first step the procedure
Deleted Effect 1 SMOKE*BMI*E 1.389 1
commences with the most
CG
1 Generating Class
b
SMOKE*BMI, 1.389 1
complex model. In our case it is
SMOKE*ECG, BMI * ECG * SMOKING.
BMI*ECG
Deleted Effect 1 SMOKE*BMI 3.080 1 Its elimination produces a chi-
2 SMOKE*ECG 3.505 1
square change of 1.389, which
3 BMI*ECG 27.631 1
2 Generating Class
b
SMOKE*ECG, 4.469 2
has an associated significance
BMI*ECG level of 0.2386. Since it is greater
Deleted Effect 1 SMOKE*ECG 7.968 1 than the criterion level of 0.05, it
2 BMI*ECG 32.094 1 is removed.
b
BMI*ECG
Step Summary
a c
b
The procedure moves on to the
CG
next hierarchical level described
CG under step 1. All 2 – way
1 Generating Class
b
SMOKE*BMI, 1.389 1 interactions between the three
SMOKE*ECG, variables are being tested.
BMI*ECG
Deleted Effect 1 SMOKE*BMI 3.080 1
2 SMOKE*ECG 3.505 1
Removal of BMI * ECG will
3 BMI*ECG 27.631 1 produce a large change of
2 Generating Class
b
SMOKE*ECG, 4.469 2 27.631 in the likelihood ratio chi-
BMI*ECG square. The P value for that is
highly significant (prob<0.0005).
2 BMI*ECG 32.094 1
b
BMI*ECG
Step Summary
a c
b
The smallest change (of 3.080) is
CG
related to the BMI * SMOKING
Deleted Effect 1 SMOKE*BMI*E 1.389 1 interaction. This is removed next.
CG And the procedure continues
b
1 Generating Class SMOKE*BMI, 1.389 1
until the final model which gives
SMOKE*ECG,
BMI*ECG
the second order interactions of
Deleted Effect 1 SMOKE*BMI 3.080 1 BMI * ECG and ECG * SMOKING.
2 SMOKE*ECG 3.505 1
3 BMI*ECG 27.631 1 Each time an estimate is
b
obtained it is called iteration.
BMI*ECG
The largest difference between
2 BMI*ECG 32.094 1
successive estimates is called
3 Generating Class
b
SMOKE*ECG, 4.469 2 convergence criterion.
BMI*ECG
Step Summary
a c
b
CG
We conclude that being
CG
overweight and smoking have
1 Generating Class
b
SMOKE*BMI, 1.389 1 each a significant association
SMOKE*ECG, with an abnormal cardiogram.
BMI*ECG
However, in this particular group
Deleted Effect 1 SMOKE*BMI 3.080 1
2 SMOKE*ECG 3.505 1
of subjects being overweight is
3 BMI*ECG 27.631 1
more harmful.
b
BMI*ECG
2 BMI*ECG 32.094 1
b
BMI*ECG
Odds Ratio
We could have inferred this by calculating the odds ratio when we
performed the cross tabulation.
The odds ratio is a measure of effect size, describing the strength

of association or non-independence between two binary data
values. It is used as a descriptive statistic, and plays an important
role in logistic regression. Unlike other measures of association for
paired binary data such as the relative risk, the odds ratio treats
the two variables being compared symmetrically, and can be
estimated using some types of non-random samples.
Odds Ratio
If we observe data in the form of a contingency table
Y = 1 Y = 0
X = 1
X = 0
then the probabilities in the joint distribution can be estimated as
Y = 1 Y = 0
X = 1
X = 0
nij
where p̂ij  with n = n11 + n10 + n01 + n00 being the sum of all four cell counts.
n
Odds Ratio
The sample log odds ratio is
The distribution of the log odds ratio is approximately normal with:
The standard errorfor the log odds ratio is approximately
.
Odds Ratio
The odds ratio calculation is shown below:
Cardiogram abnormal Cardiogram normal
(ECG 1) (ECG 2)
Overweight (BMI 1) 47 25
Normal weight (BMI 2) 8 35
Odds Ratio = 8.225, ln(Odds Ratio) = 2.11
Cardiogram abnormal Cardiogram Normal
(ECG 1) (ECG 2)
Smoker (Smoking 1) 10 15
Non-Smoker (Smoking 2) 6 30
Odds ratio = 3.33, ln(Odds Ratio) = 1.2

Comments
To perform a multi-way frequency analysis tables are formed that
contain the one-way, two-way, three-way, and higher order
associations. The log-linear model starts with all of the one-, two-
, three-, and higher-way associations, and then eliminates as
many of them as possible while still maintaining an adequate fit
between expected and observed cell frequencies.
In log-linear modelling the full model that includes all possible

main effects and interactions fits the data exactly, with zero
residual deviance. One then assesses whether a less full model
fits the data adequately by comparing its residual deviance with
the full model.
Comments
In our example, the three-way association tested was between

category of electrocardiogram, body mass index, and smoking. It
got eliminated because it was found not significant. After that a
two-way association (type of electrocardiogram and body mass
index; type of electrocardiogram and smoking) was tested. The
two-way association was found significant.
Comments
As we have seen the purpose of multi-way frequency analysis is

to test for association among discrete variables. Once a
preliminary search for association is completed by simple 2 x 2
contingency tables a model is fitted that includes only the
associations necessary to reproduce the observed frequencies.
Comments
In the above example, we have a data set with a binary response
variable (Electrocardiogram abnormal/normal) and explanatory
variables that are all categorical. In such a situation one has a
choice between using logistic regression and log-linear modelling.
For performing logistic regression rearrangement of the data is
needed so that for each variable we have a column of 1’s and 0’s.
Comments
Other differences from logistic regression are:
1. There is no clear demarcation between outcome and

explanatory variables in log-linear models.
2. Logistic regression allows continuous as well as categorical

explanatory variables to be included in the regression analysis.

Pertemuan 7 - New

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pertemuan 7 - New

Uploaded by

Copyright:

Available Formats

Pemilihan Model Terbaik

Analisis Data Kategorik

• Contoh: jika model mengandung 𝜆𝑖𝑗 𝑋𝑌 , maka model

log 𝑚𝑖𝑗𝑘 = 𝜇 + 𝜆𝑖 𝑋 + 𝜆𝑗 𝑌 +𝜆𝑘 𝑍 +𝜆𝑖𝑗 𝑋𝑌 + 𝜆𝑖𝑘 𝑋𝑍 +𝜆𝑗𝑘 𝑌𝑍 +𝜆𝑖𝑗𝑘 𝑋𝑌𝑍

merupakan model lengkap (saturated model)

Model Log linier Symbol

* Model log linier dalam tabel di atas merupakan model hirarki

In the second group of 105 with normal electrocardiograms out of

The investigators wish to assess the contribution that overweight

• ECG= Electrocardiograms (1=tidak normal,2= normal)

BMI 1= Overweight (body mass index)

Since only summary data is available use

Data > Weight cases

Analyze > Descriptive statistics > Crosstabs

Total Count 72 43 115

Total Count 97 79 176

Expected Count 97.0 79.0 176.0

Asymp. Sig. (2-

Likelihood Ratio 32.094 1 .000

Fisher's Exact Test

Linear-by-Linear Association 30.299 1 .000

N of Valid Cases 176

Analyze > Loglinear > Model Selection

Select BMI, ECG and Smoking as the factors, do not forget to

Observed Expected Std.

2 25.500 14.5% 25.500 14.5% .000 .000

2 1 8.500 4.8% 8.500 4.8% .000 .000

2 15.500 8.8% 15.500 8.8% .000 .000

2 1 6.500 3.7% 6.500 3.7% .000 .000

2 30.500 17.3% 30.500 17.3% .000 .000

The output commences with information about the number of

A hierarchical model is being fitted. In a hierarchical model it is

Likelihood Ratio Pearson

K-way and Higher Order 1 7 69.822 .000 68.727

3 1 1.389 .239 1.421

2 3 43.142 .000 45.303

Likelihood Ratio Pearson

K-way and Higher Order 1 7 69.822 .000 68.727

3 1 1.389 .239 1.421

2 3 43.142 .000 45.303

Likelihood Ratio Pearson

K-way and Higher Order 1 7 69.822 .000 68.727

3 1 1.389 .239 1.421

2 3 43.142 .000 45.303

The odds ratio is a measure of effect size, describing the strength

then the probabilities in the joint distribution can be estimated as

The distribution of the log odds ratio is approximately normal with:

The standard errorfor the log odds ratio is approximately

Cardiogram abnormal Cardiogram normal

Normal weight (BMI 2) 8 35

Odds Ratio = 8.225, ln(Odds Ratio) = 2.11

Cardiogram abnormal Cardiogram Normal

Odds ratio = 3.33, ln(Odds Ratio) = 1.2

In log-linear modelling the full model that includes all possible

In our example, the three-way association tested was between

As we have seen the purpose of multi-way frequency analysis is

Other differences from logistic regression are:

1. There is no clear demarcation between outcome and

2. Logistic regression allows continuous as well as categorical

You might also like