Professional Documents
Culture Documents
ICNN Supplementary
ICNN Supplementary
ICNN Supplementary
Algorithm 1 A typical bundle method to optimize f : Algorithm 2 Our bundle entropy method to optimize f :
Rm×n → R over Rn for K iterations with a fixed x and Rm × [0, 1]n → R over [0, 1]n for K iterations with a fixed
initial starting point y 1 . x and initial starting point y 1 .
function B UNDLE M ETHOD(f , x, y 1 , K) function B UNDLE E NTROPY M ETHOD(f , x, y 1 , K)
G ← 0 ∈ RK×n G` ← [ ]
h ← 0 ∈ RK h` ← [ ]
for k = 1, K do for k = 1, K do
GTk ← ∇y f (x, y k ; θ)T . kth row of G A PPEND(G` , ∇y f (x, y k ; θ)T )
hk ← f (x, y k ; θ) − ∇y f (x, y k ; θ)T y k A PPEND(h` , f (x, y k ; θ) − ∇y f (x, y k ; θ)T y k )
y k+1 , tk+1 ← argminy∈Y,t {t | G1:k y+h1:k ≤ ak ← L ENGTH(G` ) . The number of active
t1} constraints.
end for Gk ← C ONCAT(G` )∈ Rak ×n
return y K+1 hk ← C ONCAT(h` )∈ Rak
end function if ak = 1 then
λk ← 1
else
C. The bundle method for approximate λk ← P ROJ N EWTON L OGISTIC(Gk , hk )
inference in ICNNs end if
y k+1 ← (1 + exp(GTk λk ))−1
We here review the basic bundle method (Smola et al.,
D ELETE(G` [i] and h` [i] where λi ≤ 0) . Prune
2008) that we build upon in our bundle entropy method.
inactive constraints.
The bundle method takes advantage of the fact that for a
end for
convex objective, the first-order approximation at any point
return y K+1
is a global underestimator of the function; this lets us main-
end function
tain a piecewise linear lower bound on the function by
adding cutting planes formed by this first order approxi-
mation, and then repeatedly optimizing this lower bound. E. Deep Q-learning with ICNNs
Specifically, the process follows the procedure shown in
Algorithm 1. Denoting the iterates of the algorithm as y k , In Algorithm 3.
at each iteration of the algorithm, we compute the first or-
der approximation to the function
Algorithm 3 Deep Q-learning with ICNNs. Opt-Alg function ∆(y, y ? ). Additionally adding slack variables to
is a convex minimization algorithm such as gradient de- allow for potential violation of these constraints, we arrive
scent or the bundle entropy method. Q̃θ is the objective at the typical max-margin structured prediction optimiza-
the optimization algorithm solves. In gradient descent, tion problem
Q̃θ (s, a) = Q(s, a|θ) and with the bundle entropy method,
Q̃θ (s, a) = Q(s, a|θ) + H(a). m
λ X
Select a discount factor γ ∈ (0, 1) and moving average minimize kθk22 + ξi
θ,ξ≥0 2 i=1
factor τ ∈ (0, 1)
Initialize the ICNN −Q(s, a|θ) with target network pa- subject to f˜(xi , yi ; θ) ≤ min f˜(xi , y; θ) − ∆(yi , y) − ξi
y∈Y
rameters θ0 ← θ and a replay buffer R ← ∅
(29)
for each episode e = 1, E do
Initialize a random process N for action exploration
As a simple example, for multiclass classification tasks
Receive initial observation state s1
where y ? denotes a “one-hot” encoding of examples, we
for i = 1, I do
can use a multi-variate entropy term and let ∆(y, y ? ) =
ai ← O PT-A LG(−Qθ , si , ai,0 )+Ni . For some
y ? T (1 − y). Training requires solving this “loss-
initial action ai,0
augmented” inference problem, which is convex for suit-
Execute ai and observe ri+1 and si+1
able choices of the margin scaling term.
I NSERT(R, (si , ai , si+1 , ri+1 ))
Sample a random minibatch from the replay The optimization problem (29) is naturally still not con-
buffer: RM ⊆ R vex in θ, but can be solved via the subgradient method for
for (sm , am , s+ +
m , rm ) ∈ RM do structured prediction (Ratliff et al., 2007). This algorithm
+
am ← O PT-A LG(−Qθ0 ,s+
+
m ,am,0 ) . Uses the iteratively selects a training example xi , yi , then 1) solves
0
target parameters θ the optimization problem
+ + 0
ym ← rm + γQ(s+ m , am |θ )
end for y ? = argmin f (xi , y; θ) − ∆(yi , y) (30)
Update θ with a gradient step to minimize L = y∈Y
1
P 2
|RM | m Q̃(sm , am |θ) − ym
and 2) if the margin is violated, updates the network’s pa-
θ0 ← τ θ + (1 − τ )θ0 . Update the target
rameters according to the subgradient
network.
end for
end for θ := P+ [θ − α (λθ + ∇θ f (xi , yi , θ) − ∇θ f (xi , y ? ; θ))]
(31)
(z)
where P+ denotes the projection of W1:k−1 onto the non-
F. Max-margin structured prediction negative orthant. This method can be easily adapted to use
mini-batches instead of a single example per subgradient
In the more traditional structured prediction setting, where step, and also adapted to alternative optimization methods
we do not aim to fit the energy function directly but fit like AdaGrad (Duchi et al., 2011) or ADAM (Kingma &
the predictions made by the system to some target out- Ba, 2014). Further, a fast approximate solution to y ? can
puts, there are different possibilities for learning the ICNN be used instead of the exact solution.
parameters. One such method is based upon the max-
margin structured prediction framework (Tsochantaridis
G. Proof of Proposition 3
et al., 2005; Taskar et al., 2005). Given some training ex-
ample (x, y ? ), we would like to require that this example Proof (of Proposition 3). We have by the chain rule that
has a joint energy that is lower than all other possible val-
ues for y. That is, we want the function f˜ to satisfy the
∂` ∂` ∂ ŷ ∂G ∂ ŷ ∂h
constraint = + . (32)
∂θ ∂ ŷ ∂G ∂θ ∂h ∂θ
f˜(x, y ? ; θ) ≤ min f˜(x, y; θ) (28)
y
The challenging terms to compute in this equation are the
∂ ŷ ∂ ŷ
Unfortunately, these conditions can be trivially fit by ∂G and ∂h terms. These can be computed (although we
choosing a constant f˜ (although the entropy term allevi- will ultimately not compute them explicitly, but just com-
ates this problem slightly, we can still choose an approx- pute the product of these matrices and other terms in the
imately constant function), so instead the max-margin ap- Jacobian), by implicit differentiation of the KKT condi-
proach adds a margin-scaling term that requires this gap to tions. Specifically, the KKT conditions of the bundle en-
be larger for y further from y ? , as measured by some loss tropy method (considering only the active constraints at the
Input Convex Neural Networks: Supplementary Material
solution) are given by H. State and action space sizes in the OpenAI
gym MuJoCo benchmarks.
1 + log ŷ − log(1 − ŷ) + GT λ = 0
Gŷ + h − t1 = 0 (33) Environment # State # Action
T
1 λ = 1. InvertedPendulum-v1 4 1
InvertedDoublePendulum-v1 11 1
For simplicity of presentation, we consider first the Jaco- Reacher-v1 11 2
bian with respect to h. Taking differentials of these equa- HalfCheetah-v1 17 6
tions with respect to h gives Swimmer-v1 8 2
Hopper-v1 11 3
1 1
Walker2d-v1 17 6
diag + dy + GT dλ = 0 Ant-v1 111 8
ŷ 1 − ŷ
(34) Humanoid-v1 376 17
Gdy + dh − dt1 = 0 HumanoidStandup-v1 376 17
1T dλ = 0
Table 4. State and action space sizes in the OpenAI gym MuJoCo
or in matrix form benchmarks.
1
diag ŷ1 + 1−ŷ GT
0 dy 0 I. Synthetic classification examples
−1 dλ = −dh .
G 0
0 0 −1T dt 0 We begin with a simple example to illustrate the classi-
(35) fication performance of a two-hidden-layer FICNN and
To compute the Jacobian ∂h∂ ŷ
we can solve the system above PICNN on two-dimensional binary classification tasks
with the right hand side given by dh = I, and the resulting from the scikit-learn toolkit (Pedregosa et al., 2011). Fig-
dy term will be the corresponding Jacobian. However, in ure 4 shows the classification performance on the dataset.
our ultimate objective we always left-multiply the proper The FICNN’s energy function which is fully convex in
terms in the above equation by ∂∂`ŷ . Thus, we instead define X × Y jointly is able to capture complex, but sometimes
restrictive decision boundaries. The PICNN, which is non-
−1 convex over X but convex over Y overcomes these restric-
cy
diag ŷ1 + 1
1−ŷ
GT 0 −( ∂∂`ŷ )T
cλ =
G 0
−1 0 tions and can capture more complex decision boundaries.
ct 0 −1T 0 0
(36) J. Multi-Label Classification Training Plots
and we have the the simple formula for the Jacobian prod-
uct In Figure 5.
∂` ∂ ŷ
= (cλ )T . (37)
∂ ŷ ∂h K. Image Completion
A similar set of operations taking differentials with respect The losses are in Figure 6.
to G leads to the matrix equations
GT
1 1
diag ŷ
+ 1−ŷ
0 dy −dGT λ
−1 dλ = −dGy
G 0
0 −1T 0 dt 0
(38)
and the corresponding Jacobian products / gradients are
given by
∂` ∂ ŷ
= cy λT + ŷ(cλ )T . (39)
∂ ŷ ∂G
Finally, using the definitions that
Figure 4. FICNN (top) and PICNN (bottom) classification of synthetic non-convex decision boundaries. Best viewed in color.
Figure 5. Training (blue) and test (red) macro-F1 score of a feedforward network (left) and PICNN (right) on the BibTeX multi-label
classification dataset. The final test F1 scores are 0.396 and 0.415, respectively. (Higher is better.)
Figure 6. Mean Squared Error (MSE) on the train (blue, rolling over 1 epoch) and test (red) images from Olivetti faces for PICNNs
trained with the bundle entropy method (left) and back optimization (center), and back optimization with the convexity constraint
relaxed (right). The minimum test MSEs are 833.0, 872.0, and 850.9, respectively.