Homework1 Solution

10-708 Probabilistic Graphical Models
Homework 1 Solutions March 11, 2015
1 Directed Graphical Models (Pengtao)

1.1 Factorization to I-map
Given that P factorizes according to G, we have that:
n
Y
P (X1 , . . . , Xn ) = P (Xi | Pa(Xi )), (1)
i=1
where X1 , . . . , Xn are the random variables corresponding the nodes of graph G and Pa(Xi ) denotes the
parents of node Xi in graph G. Furthermore, let ND(Xi ) denote the non-descendant nodes of node Xi in
graph G and D(Xi ) denote its descendant nodes. Using the above factorization, we can see that:
P (Xi , ND(Xi ))
P (Xi | ND(Xi )) = ,
P (ND(Xi ))
P
D(Xi ) P (X1 , . . . , Xn )
= P ,
Xi ,D(Xi ) P (X1 , . . . , Xn )
P Qn
D(Xi ) j=1 P (Xj | Pa(Xj ))
= P Qn ,
Xi ,D(Xi ) j=1 P (Xj | Pa(Xj )) (2)
Q P Q
Xj ∈(ND(Xi )∪Xi ) P (Xj | Pa(Xj )) D(Xi ) Xj ∈D(Xi ) P (Xj | Pa(Xj ))
= Q P Q ,
Xj ∈ND(Xi ) P (Xj | Pa(Xj )) Xi ,D(Xi ) Xj ∈(D(Xi )∪Xi ) P (Xj | Pa(Xj ))
Q
Xj ∈(ND(Xi )∪Xi ) P (Xj | Pa(Xj )) · 1
= Q ,
Xj ∈ND(Xi ) P (Xj | Pa(Xj )) · 1
= P (Xi | Pa(Xi )).

This means that:
{Xi ⊥ ND(Xi ) | Pa(Xi ); i = 1, . . . , n} , (3)
which by the local Markov assumptions we know to be the independence assertions included in set I(G)
(using the notation from our lecture notes). Therefore, we see that those assertions are included in I(P ),
i.e. I(G) ⊆ I(P ), and so we have proven that if P factorizes according to G, then G is an I-map for P .
1.2 D-separation
• The joint distribution can be written as:
P (X1 , . . . , X7 ) = P (X1 )P (X3 | X1 )P (X2 | X3 )P (X5 | X2 )P (X6 | X2 )P (X7 )P (X4 | X3 , X7 ). (4)
• Yes. There is only one path going from X1 to X5 , which is a causal trail. One nodes along the path
is X2 . Therefore, when X2 is observed, the path becomes blocked and X1 is independent of X5 .
• No. X7 and X3 are both parents of X4 . Therefore, when X4 is observed, these two nodes become
dependent. Also, X3 is clearly dependent with X2 and so, when X4 is observed, X2 and X7 become
dependent.
• Yes. When X3 is observed, the path between X4 and X2 is blocked. That is part of the path between
X4 and X5 and so, when X3 is observed, X4 and X5 become independent.
• The variables that are in the Markov blanket of X3 are: X1 , X2 , X4 , and X7 .
1
1.3 Hidden Markov Model
From the definition of the hidden Markov model (HMM) we have that:
P (x1 , . . . , xi , z1 , . . . , zi ) = P (xi | zi )P (zi | zi−1 )P (x1 , . . . , xi−1 , z1 , . . . , zi−1 ). (5)
We can now marginalize out the variables z1 , . . . , zi−1 to obtain the following:
X
P (x1 , . . . , xi , zi ) = P (x1 , . . . , xi , z1 , . . . , zi ),
z1 ,...,zi−1
X
= P (xi | zi )P (zi | zi−1 )P (x1 , . . . , xi−1 , z1 , . . . , zi−1 ),
z1 ,...,zi−1
X
= P (xi | zi ) P (zi | zi−1 )P (x1 , . . . , xi−1 , z1 , . . . , zi−1 ), (6)
z1 ,...,zi−1
X X
= P (xi | zi ) P (zi | zi−1 ) P (x1 , . . . , xi−1 , z1 , . . . , zi−1 ),
zi−1 z1 ,...,zi−2
X
= P (xi | zi ) P (zi | zi−1 )P (x1 , . . . , xi−1 , zi−1 ),
zi−1
The proof completes.
2 Implementing a Fully Observed Directed Graphical Model (Mrin-

maya)
Solution courtesy Emmanouil Antonios Platanios (Anthony)
1. Baseline Graphical Model: I first implemented a baseline graphical model where all variables are
assumed to be independent. That model only needs 12 parameters and its accuracy, as measured by the
`1 -distance of the model and the true joint probability distribution, is equal to 0.6960. Furthermore,
the log-likelihood value for the data is equal to −17, 177, 497.9001.
2. Our Graphical Model: I designed the following graphical model initially which feels intuitive for
the variables involved. The following list includes each node of the graph and its parents:
• IsSummer
• HasFlu ← IsSummer
• HasFoodPoisoning
• HasHayFever ← IsSummer
• HasPneumonia ← IsSummer
• HasRespiratoryProblems ← HasFlu | HasHayFever | HasPneumonia
• HasGastricProblems ← HasFoodPoisoning
• HasRash ← HasFlu | HasHayFever
• Coughs ← HasFlu | HasHayFever | HasPneumonia
• IsFatigued ← HasFlu | HasHayFever | HasPneumonia
• Vomits ← HasFlu | HasFoodPoisoning | HasGastricProblems
• HasFever ← HasFlu | HasPneumonia
2
This model is quite simple to understand. The reasoning behind creating it is that we use as parent
variables for each variable, variables that describe events that might cause the event described by the
child variable. Therefore, in our model for example, symptoms are caused by illnesses and therefore the
variables corresponding to symptoms are children of variables corresponding to the illnesses that can
cause those symptoms. The implementation can return the probability for any possible assignment.
More details on how to use the code are provided in the “readme.txt” file submitted with the code.
3. Compactness: The total number of parameters of this model is 1 + 2 + 1 + 2 + 2 + 23 + 2 + 22 +
23 + 23 + 23 + 22 = 50 (i.e., counting 1 parameter for variables with no parents – the probability of
the variable being equal to true – and 2n parameters for each variable with n parents – the conditional
probabilities table).
4. Accuracy: The `1 -distance of my model and the true joint probability distribution is equal to 0.3320
in this case, which is significantly smaller than that of the baseline model, as expected.
5. Data Likelihood: The log-likelihood of the provided data set after having fit our model to the data
is equal to -15,163,339.2713.
6. Querying: The code for querying has been submitted along with the rest of the code. The main
idea behind my code is that I consider the joint probability distribution table (effectively a tensor)
over all possible assignments, “filtered” by the observed variables and re-normalized, and I sum over
all dimensions of the table corresponding to variables other than the query variables. The remaining
table (i.e., tensor) is the resulting distribution for the query variables. The outputs for the example
queries provided in the problems handout and for many different models that I tried using can be
seen by running the submitted code by following the instructions provided in the “readme.txt” file.
The outputs also include the true probability distributions corresponding to these queries, computed
in the same way from the true probability distribution provided (as opposed to the joint probability
distribution table under our model). Please note that the querying code, as well as the rest of the code,
is highly inefficient – it is not optimized, as optimizing the code is outside the scope of this assignment.
7. Improved Graphical Model: I tried several improvements over the graphical model shown above.
All models I went through are included in the submitted code in files named as model#.py, where #
corresponds to the attempt number. On my sixth attempt I had the following model, which includes
some refinements over the previous model:
• IsSummer
• HasFlu ← IsSummer
• HasFoodPoisoning
• HasHayFever
• HasPneumonia ← IsSummer
• HasRespiratoryProblems ← HasFlu | HasHayFever | HasPneumonia | HasFoodPoisoning
• HasGastricProblems ← HasFlu | HasFoodPoisoning
• HasRash ← HasFoodPoisoning | HasHayFever
• Coughs ← HasFlu | HasPneumonia | HasRespiratoryProblems
• IsFatigued ← HasFlu | HasHayFever | HasPneumonia
• Vomits ← HasFoodPoisoning | HasGastricProblems
• HasFever ← HasFlu | HasPneumonia
This model uses 55 parameters and its accuracy, as measured by the `1 -distance of the model and the
true joint probability distribution, is equal to 0.2667, which is significantly better than the previous
accuracy we obtained. Furthermore, the log-likelihood value for the data is equal to −14, 899, 891.2238.
3
3 Undirected Graphical Models (Mrinmaya)
Solution courtesy Emmanouil Antonios Platanios (Anthony)
3.1 Markov Properties

We have the following mathematical definitions for the three concepts, written in terms of the set of condi-
tional independence properties that they assert:
1. Global Markov Property: {Xi ⊥ Xj | Xk : ∀i, j, k ∈ V, s.t. sepG (Xi ; Xj | Xk )}, where sepG (Xi ; Xj |
Xk ) means that Xk appears in all the paths from Xi to Xj in graph G (i.e., Xk separates Xi and Xj ).
2. Local Markov Property: {Xi ⊥ Xv | Xn : i ∈ V, v ∈ V \(i ∪ NG (Xi )), n ∈ NG (Xi )}, where NG (Xi )
denotes the indexes of the neighbors of node Xi in graph G.
3. Pairwise Markov Property: {Xi ⊥ Xj | Xv : {i, j} ∈

/ E, v ∈ V \ {i, j}}.
Now, it is easy to see that sepG (Xi ; Xj | Xk ) holds for all i ∈ V , k ∈ NG (Xi ), and j ∈ V \(i ∪ NG (Xi )). That
is because in order to go to any other node in the graph, starting from Xi , you need to follow a path that goes
through at least one of the neighbors of Xi (that follows by the definition of what a neighbor in a graph is).
Therefore, by the global Markov property we get that {Xi ⊥ Xv | Xn : i ∈ V, v ∈ V \(i ∪ NG (Xi )), n ∈ NG (Xi )},
which is the local Markov property. So, we have that “Global Markov Property” ⇒ “Local Markov Property”.
It is also easy to see that if Xi and Xj are not neighbors (i.e., {i, j} ∈
/ E), then V \ {i, j} contains the indexes of
all the neighbors of Xi (among other nodes) and j ∈ V \(i∪NG (Xi )). Therefore, from the local Markov prop-
erty we get that {Xi ⊥ Xj | Xn : {Xi , Xj } ∈ / E, n ∈ NG (Xi )} ⇒ {Xi ⊥ Xj | Xv : {i, j} ∈ / E, v ∈ V \ {i, j}}.
This is in fact the pairwise Markov property and so we have shown that “Local Markov Property” ⇒
“Pairwise Markov Property”.
3.2 Gaussian Graphical Model

1. We have that:

1 1 >
P (X | µ, Σ) = exp − (X − µ) Ω(X − µ) ,
(2π)d/2 |Σ|1/2 2

1 1 > > 1 >
= exp − X ΩX + X Ωµ − µ Ωµ ,
(2π)d/2 |Σ|1/2 2 2 (7)
 

Y 1 Y X
∝ exp − xi Ωij xj exp xi Ωij µj ,
2
(i,j)∈E i∈V j∈V
where we used the fact that this is a complete graph (i.e., there exists an edge between every possible
pair of nodes), as mentioned in the question. By matching the form that we obtained with the provided
form, we see that:
 

1 X
ψi,j (xi , xj ) = exp − xi Ωij xj and ψi (xi ) = exp xi Ωij µj  . (8)
2
j∈V
2. Since we take the product over all edges and the product over all nodes, we can incorporate the product
over all nodes in the product over all edges, while being careful for “double-counting”. Note that if a
node has n neighbors (i.e., n edges involving that node), that means that if we incorporate the term
ψi (xi ) corresponding to that node in the edges product, we would need to take it’s nth root, because it
4
would appear in n terms in that product (i.e., one term for each edge in which it appears). Therefore,
we can see that: Y 1 1
P (X | µ, Σ) ∝ ψi,j (xi , xj )ψi (xi ) n(i) ψj (xj ) n(j) , (9)
(i,j)∈E
which is the form that is provided to us in the problem sheet. Now, we see that if (i, j) ∈ / E, then in
order for the value of this density to remain unchanged we need to have that Ωij = 0 (i.e., so that all
the terms of the product term corresponding to that edge are equal to 1). It is not difficult to see then
that given XV \{i,j} and Ωij = 0, there is not “coupling” between the terms involving Xi and Xj and
the conditional probability density function factorizes with respect to these two variables, implying
that Xi ⊥ Xj | XV \{i,j} . Furthermore, following the same way of reasoning, if Xi ⊥ Xj | XV \{i,j} we
need to have no coupling between the terms involving Xi and Xj in the above equations. This means
that (i, j) must not be in E. Thus, we have argued (but not proved formally as this was not required
from the problem statement) that (i, j) ∈ / E ⇔ Xi ⊥ Xj | XV \{i,j} .
3.3 Ising Model

We have that:
p(X1 , . . . , Xs = 1, . . . , Xn ; θ)
p(Xs = 1 | {X1 , . . . , Xn }\Xs ; θ) = ,
p(X1 , . . . , Xs = 0, . . . , Xn ; θ) + p(X1 , . . . , Xs = 1, . . . , Xn ; θ)
P
exp zs + θs + t∈V s.t. (s,t)∈E θs,t xt
= ,
P
exp (zs ) + exp zs + θs + t∈V s.t. (s,t)∈E θs,t xt (10)
P
exp θs + t∈V s.t. (s,t)∈E θs,t xt
= P ,
1 + exp θs + t∈V s.t. (s,t)∈E θs,t xt
where: X X
zs = θv xv + θv,t xv xt . (11)
v∈V \s (v,t)∈E
v,t6=s
It is easy to see that this is a logistic regression model over the neighbors of Xs .
3.4 Boltzmann Machines

We have that: !
1 X
P (x, y) = P P exp θk φk (x, y) ⇒ (12)
x,y exp ( k θk φk (x, y)) k
!
1 X X
P (x) = P P exp θk φk (x, y) ⇒ (13)
x,y exp ( k θk φk (x, y)) y k
! !
X X X X
log P (x) = log exp θk φk (x, y) − log exp θk φk (x, y) ⇒ (14)
y k x,y k
P P
∂ log P ((x)) X exp ( k θk φk (x, y)) X exp ( k θk φk (x, y))
= P P φl (x, y) − P P φl (x, y),
∂θl y y exp ( k θk φk (x, y)) x,y x,y exp ( k θk φk (x, y))
X P (x, y) X
= φl (x, y) − P (x, y)φl (x, y), (15)
y
P (x) x,y
X X
= P (y | x)φl (x, y) − P (x, y)φl (x, y).
y x,y
5
4 Generalized Linear Models (Xun)
4.1 Exponential Family
1. The distribution is invariant under linear transformations to η and T (x). For instance, we can define
ηe = cη and Te(x) = T (x)/c with any nonzero constant c.
2. Moment generating function of T (X):
Z
tT (X)
ψ(t) = E e = etT (x) h(x) exp {ηT (x) − A(η)} dx (16)
Z
= h(x) exp {(t + η)T (x) − A(η)} dx (17)
Z
= h(x) exp {(t + η)T (x) − A(η) + A(t + η) − A(t + η)} dx (18)
Z
= exp {A(t + η) − A(η)} h(x) exp {(t + η)T (x) − A(t + η)} dx (19)
= exp {A(t + η) − A(η)} . (20)
Cumulant generating function:
g(t) = log ψ(t) = A(t + η) − A(η). (21)
First cumulant is the mean:
E (T (X)) = g 0 (0) = A0 (η). (22)
Second cumulant is the variance:
Var(T (X)) = g 00 (0) = A00 (η). (23)
3. Recall the inner product between two matrices hA, Bi = tr A> B = i,j Aij Bij .
P

−n − 21 1 > −1
p(x|µ, Σ) = (2π) 2|Σ| exp − (x − µ) Σ (x − µ) (24)
2

1 > −1 1
∝ exp − (x − µ) Σ (x − µ) − log det Σ (25)
2 2

1 > −1 > −1 1 > −1 1
∝ exp − x Σ x + x Σ µ − µ Σ µ − log det Σ (26)
2 2 2

1 −1 > −1 > 1 > −1 1
∝ exp tr − Σ xx + (Σ µ) x − µ Σ µ − log det Σ (27)
2 2 2

1 −1 > −1 1 > −1 1
∝ exp − Σ , xx + Σ µ, x − µ Σ µ − log det Σ (28)
2 2 2
∝ exp {hη, T (x)i − A(η)} , (29)
where
Σ−1 µ

x 1 > −1 1
η= , T (x) = , A(η) = µ Σ µ + log det Σ. (30)
− 12 Σ−1 xx> 2 2
6
Inverse parameter mapping:
1 1
µ = − η2−1 η1 , Σ = − η2−1 . (31)
2 2
Now rewrite A(η) as a function of η:

>
1 1 1 1
A(η) = − − η2−1 η1 (−2η2 ) − η2−1 η1 + log |Σ| (32)
2 2 2 2
1 1
= − η1> η2−1 η1 − log |−2η2 | . (33)
4 2
4. Let λ be the set of dual variables.
(a) Lagrangian:
Z Z X Z
L(f, λ) = − f (x) log f (x) dx + λ0 f (x) dx − 1 + λk gk (x)f (x) dx − µk (34)
k
Z !
X X
= −f (x) log f (x) + λ0 f (x) + λk gk (x)f (x) dx − λ0 − λk µk . (35)
k k
Let L
e be the integrand, which does not contain derivatives of f (x). By the optimality condition,
δL ∂Le X
= = − log f (x) − 1 + λ0 + λk gk (x) = 0. (36)
δf (x) ∂f (x)
k
( )
X
⇒ f ∗ (x) = exp λ∗k gk (x) + λ∗0 − 1 , (37)
k
where λ∗ is chosen such that f ∗ is feasible. This is clearly in exponential family form.
(b) Lagrangian:
Z
L(f, λ) = − f (x) log f (x) + λ0 f (x) + λ1 xf (x) + λ2 (x − µ)2 f (x) dx + const. (38)
Let L
e be the integrand, which does not contain derivatives of f (x). By the optimality condition,
δL ∂Le
= = − log f (x) − 1 + λ0 + λ1 x + λ2 (x − µ)2 = 0. (39)
δf (x) ∂f (x)
n o
⇒ f ∗ (x) = exp λ∗1 x + λ∗2 (x − µ)2 + λ∗0 − 1 . (40)
Now check the feasibility. It is easy to see that in order to make f ∗ (x) integrate to 1, it requires
λ∗1 = 0 and λ∗2 < 0 (otherwise the integral will be unbounded). Therefore
f ∗ (x) = exp λ∗2 (x − µ)2 + λ∗0 − 1 .

(41)
Now integrate using Gaussian integral:

Z Z r
∗ λ∗ −1 λ∗ (x−µ)2 λ∗ −1 π
1 = f (x) dx = e 0 e 2 dx = e 0 − ∗. (42)
λ2
7
q
∗ λ∗
Therefore eλ0 −1 = − π .
2
Then
r
∗ λ∗2 λ∗2 (x−µ)2
f (x) = − e . (43)
π
Verify the mean:
r r
λ∗ λ∗2
Z Z r
λ∗
2 (x−µ)
2 π
µ= xf (x) dx = − 2 xe dx = − µ − ∗ = µ. (44)
π π λ2
Verify the variance:

r r s
λ∗2 λ∗2 1
Z Z
2 2 2 λ∗ t2 π 1
σ = (x − µ) f (x) dx = − t e 2 dt = − ∗ 3 = − ∗, (45)
π π 2 (−λ2 ) 2λ2
where t = x − µ. Thus λ∗2 = − 2σ1 2 .

Plug back in f ∗ (x):
1 (x−µ)2
f ∗ (x) = √ e− 2σ2 , (46)
2πσ
which is Gaussian distribution with mean µ and variance σ 2 .
Additional notes on the calculus of variations: Consider the functional

Z
F [y] = G [y(x), y 0 (x), x] dx, (47)
dy
where y 0 (x) = dx . A necessary condition for F [y] to attain extremum is given by the Euler–Lagrange
equation:
δF ∂G d ∂G
= − = 0, (48)
δy(x) ∂y dx ∂y 0
where the left hand side is the functional derivative of F w.r.t. y at the point x. In both (35) and (38) the
integrand only depends on y(x) and x, thus (48) becomes
δF ∂G
= = 0. (49)
δy(x) ∂y
4.2 Poisson Regression

1. Rewrite Poisson in exponential family form:
λy e−λ 1
p(y|x) = = exp (y log λ − λ) . (50)
y! y!
Thus
η = log λ, T (y) = y, A(η) = λ = eη , ψ(µ) = log µ. (51)
Therefore the canonical response function
f (ξ) = ψ −1 (ξ) = eξ . (52)
8
2. Stochastic update rule has the form (Eq.(8.84) in Jordan’s textbook):
(t)
θ(t+1) = θ(t) + ρ(yi − µi )xi , (53)
(t)
where µi = f ( θ(t) , xi ). Plugging in f , we have
(t)
θ(t+1) = θ(t) + ρ(yi − ehθ ,xi i
)xi . (54)
3. IRLS update rule (Eq.(8.91) in Jordan’s textbook):

−1 h i−1
(t+1) > (t) > (t) (t) (t)
θ = X W X X W η+ W (y − µ ) , (55)
where W is a diagonal matrix with
dµi d2 A(η)
Wii = = = eη = ehθ,xi i . (56)
dηi dηi2
More detailed derivation can be found in Section 8.2.1 of Jordan’s textbook.

Homework1 Solution

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Homework1 Solution

Uploaded by

Copyright:

Available Formats

10-708 Probabilistic Graphical Models

Homework 1 Solutions March 11, 2015

1 Directed Graphical Models (Pengtao)

= P (Xi | Pa(Xi )).

P (x1 , . . . , xi , z1 , . . . , zi ) = P (xi | zi )P (zi | zi−1 )P (x1 , . . . , xi−1 , z1 , . . . , zi−1 ). (5)

The proof completes.

2 Implementing a Fully Observed Directed Graphical Model (Mrin-

3.1 Markov Properties

3. Pairwise Markov Property: {Xi ⊥ Xj | Xv : {i, j} ∈

3.2 Gaussian Graphical Model

3.3 Ising Model

3.4 Boltzmann Machines

= exp {A(t + η) − A(η)} . (20)

Cumulant generating function:

g(t) = log ψ(t) = A(t + η) − A(η). (21)

First cumulant is the mean:

E (T (X)) = g 0 (0) = A0 (η). (22)

Second cumulant is the variance:

Var(T (X)) = g 00 (0) = A00 (η). (23)

Now rewrite A(η) as a function of η:

4. Let λ be the set of dual variables.

f ∗ (x) = exp λ∗2 (x − µ)2 + λ∗0 − 1 .

Now integrate using Gaussian integral:

Verify the variance:

where t = x − µ. Thus λ∗2 = − 2σ1 2 .

Additional notes on the calculus of variations: Consider the functional

4.2 Poisson Regression

η = log λ, T (y) = y, A(η) = λ = eη , ψ(µ) = log µ. (51)

Therefore the canonical response function

f (ξ) = ψ −1 (ξ) = eξ . (52)

3. IRLS update rule (Eq.(8.91) in Jordan’s textbook):

where W is a diagonal matrix with

More detailed derivation can be found in Section 8.2.1 of Jordan’s textbook.

You might also like