ICNN Supplementary

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Input Convex Neural Networks: Supplementary Material

Brandon Amos Lei Xu J. Zico Kolter

A. Additional architectures work in (2) can be written as the optimization problem


A.1. Convolutional architectures minimize zk
y,z1 ,...,zk
Convolutions are important to many visual structured tasks. (z) (y)
subject to zi+1 ≥ Wi zi + Wi y + bi , i = 0, . . . , k − 1
We have left convolutions out to keep the prior ICNN no-
tation light by using matrix-vector operations. ICNNs can zi ≥ 0, i = 1, . . . , k − 1.
be similarly created with convolutions because the convo- (22)
lution is a linear operator.
This problem exactly replicates the equations of the
The construction of convolutional layers in ICNNs depends FICNN, with the exception that we have replaced ReLU
on the type of input and output space. If the input and out-
and the equality constraint between layers with a positiv-
put space are similarly structured (e.g. both spatial), the
ity constraint on the zi terms and an inequality. How-
jth feature map of a convolutional PICNN layer i can be ever, because we are minimizing the final zk term, and
defined by because each inequality constraint is convex, at the solu-
tion one of these constraints must be tight, i.e., (zi )j =
  (Wi(z) zi + Wi(y) y + bi )j or (zi )j = 0, which recovers the
j (z) (x) (y)
zi+1 = gi zi ∗ Wi,j + (Sx) ∗ Wi,j + (Sy) ∗ Wi,j + bi,j ReLU non-linearity exactly. The exact same procedure can
(20) be used to write to create an exact inference procedure for
the PICNN.
Although the LP formulation is appealing in its simplicity,
where the convolution kernels W are the same size and S
in practice these optimization problems will have a num-
scales the input and output to be the same size as the previ-
ber of variables equal to the total number of activations in
ous feature map, and were we omit some of the Hadamard
the entire network. Furthermore, most LP solution meth-
product terms that can appear above for simplicity of pre-
ods to solve such problems require that we form and in-
sentation.
vert structured matrices with blocks such as WiT Wi — the
If the input space is spatial, but the output space has another case for most interior-point methods (Wright, 1997) or even
structure (e.g. the simplex), the convolution over the output approximate algorithms such as the alternating direction
space can be replaced by a matrix-vector operation, such as method of multipliers (Boyd et al., 2011) — which are large
dense matrices or have structured forms such as non-cyclic
convolutions that are expensive to invert. Even incremental
approaches like the Simplex method require that we form
inverses of subsets of columns of these matrices, which are
 
j (z) (x) (y)
zi+1 = gi zi ∗ Wi,j + (Sx) ∗ Wi,j + Bi,j y + bi,j
additionally different for structured operations like convo-
(21) lutions, and which overall still involve substantially more
computation than a single forward pass. Furthermore, such
solvers typically do not exploit the substantial effort that
(y)
where the product Bi,j y is a scalar. has gone in to accelerating the forward and backward com-
putation passes for neural networks using hardware such
B. Exact inference in ICNNs as GPUs. Thus, as a whole, these do not present a viable
option for optimizing the networks.
Although it is not a practical approach for solving the opti-
mization tasks, we first highlight the fact that the inference
problem for the networks presented above (where the non-
linear are either ReLU or linear units) can be posed as as
linear program. Specifically, considering the FICNN net-
Input Convex Neural Networks: Supplementary Material

Algorithm 1 A typical bundle method to optimize f : Algorithm 2 Our bundle entropy method to optimize f :
Rm×n → R over Rn for K iterations with a fixed x and Rm × [0, 1]n → R over [0, 1]n for K iterations with a fixed
initial starting point y 1 . x and initial starting point y 1 .
function B UNDLE M ETHOD(f , x, y 1 , K) function B UNDLE E NTROPY M ETHOD(f , x, y 1 , K)
G ← 0 ∈ RK×n G` ← [ ]
h ← 0 ∈ RK h` ← [ ]
for k = 1, K do for k = 1, K do
GTk ← ∇y f (x, y k ; θ)T . kth row of G A PPEND(G` , ∇y f (x, y k ; θ)T )
hk ← f (x, y k ; θ) − ∇y f (x, y k ; θ)T y k A PPEND(h` , f (x, y k ; θ) − ∇y f (x, y k ; θ)T y k )
y k+1 , tk+1 ← argminy∈Y,t {t | G1:k y+h1:k ≤ ak ← L ENGTH(G` ) . The number of active
t1} constraints.
end for Gk ← C ONCAT(G` )∈ Rak ×n
return y K+1 hk ← C ONCAT(h` )∈ Rak
end function if ak = 1 then
λk ← 1
else
C. The bundle method for approximate λk ← P ROJ N EWTON L OGISTIC(Gk , hk )
inference in ICNNs end if
y k+1 ← (1 + exp(GTk λk ))−1
We here review the basic bundle method (Smola et al.,
D ELETE(G` [i] and h` [i] where λi ≤ 0) . Prune
2008) that we build upon in our bundle entropy method.
inactive constraints.
The bundle method takes advantage of the fact that for a
end for
convex objective, the first-order approximation at any point
return y K+1
is a global underestimator of the function; this lets us main-
end function
tain a piecewise linear lower bound on the function by
adding cutting planes formed by this first order approxi-
mation, and then repeatedly optimizing this lower bound. E. Deep Q-learning with ICNNs
Specifically, the process follows the procedure shown in
Algorithm 1. Denoting the iterates of the algorithm as y k , In Algorithm 3.
at each iteration of the algorithm, we compute the first or-
der approximation to the function

f (x, y k ; θ) + ∇y f (x, y k ; θ)T (y − y k ) (23)

and update the next iteration by solving the optimization


problem

y k+1 := argmin max {f (x, y i ; θ)+∇y f (x, y i ; θ)T (y−y i )}.


y∈Y 1≤i≤k
(24)
A bit more concretely, the optimization problem can be
written via a set of linear inequality constraints

y k+1 , tk+1 := argmin {t | Gy + h ≤ t1} (25)


y∈Y,t

where G ∈ Rk×n has rows equal to

giT = ∇y f (x, y i ; θ)T (26)

and h ∈ Rk has entries equal to

hi = f (x, y i ; θ) − ∇y f (x, y i ; θ)T y i . (27)

D. Bundle Entropy Algorithm


In Algorithm 2.
Input Convex Neural Networks: Supplementary Material

Algorithm 3 Deep Q-learning with ICNNs. Opt-Alg function ∆(y, y ? ). Additionally adding slack variables to
is a convex minimization algorithm such as gradient de- allow for potential violation of these constraints, we arrive
scent or the bundle entropy method. Q̃θ is the objective at the typical max-margin structured prediction optimiza-
the optimization algorithm solves. In gradient descent, tion problem
Q̃θ (s, a) = Q(s, a|θ) and with the bundle entropy method,
Q̃θ (s, a) = Q(s, a|θ) + H(a). m
λ X
Select a discount factor γ ∈ (0, 1) and moving average minimize kθk22 + ξi
θ,ξ≥0 2 i=1
factor τ ∈ (0, 1)  
Initialize the ICNN −Q(s, a|θ) with target network pa- subject to f˜(xi , yi ; θ) ≤ min f˜(xi , y; θ) − ∆(yi , y) − ξi
y∈Y
rameters θ0 ← θ and a replay buffer R ← ∅
(29)
for each episode e = 1, E do
Initialize a random process N for action exploration
As a simple example, for multiclass classification tasks
Receive initial observation state s1
where y ? denotes a “one-hot” encoding of examples, we
for i = 1, I do
can use a multi-variate entropy term and let ∆(y, y ? ) =
ai ← O PT-A LG(−Qθ , si , ai,0 )+Ni . For some
y ? T (1 − y). Training requires solving this “loss-
initial action ai,0
augmented” inference problem, which is convex for suit-
Execute ai and observe ri+1 and si+1
able choices of the margin scaling term.
I NSERT(R, (si , ai , si+1 , ri+1 ))
Sample a random minibatch from the replay The optimization problem (29) is naturally still not con-
buffer: RM ⊆ R vex in θ, but can be solved via the subgradient method for
for (sm , am , s+ +
m , rm ) ∈ RM do structured prediction (Ratliff et al., 2007). This algorithm
+
am ← O PT-A LG(−Qθ0 ,s+
+
m ,am,0 ) . Uses the iteratively selects a training example xi , yi , then 1) solves
0
target parameters θ the optimization problem
+ + 0
ym ← rm + γQ(s+ m , am |θ )
end for y ? = argmin f (xi , y; θ) − ∆(yi , y) (30)
Update θ with a gradient step to minimize L = y∈Y
1
P 2
|RM | m Q̃(sm , am |θ) − ym
and 2) if the margin is violated, updates the network’s pa-
θ0 ← τ θ + (1 − τ )θ0 . Update the target
rameters according to the subgradient
network.
end for
end for θ := P+ [θ − α (λθ + ∇θ f (xi , yi , θ) − ∇θ f (xi , y ? ; θ))]
(31)
(z)
where P+ denotes the projection of W1:k−1 onto the non-
F. Max-margin structured prediction negative orthant. This method can be easily adapted to use
mini-batches instead of a single example per subgradient
In the more traditional structured prediction setting, where step, and also adapted to alternative optimization methods
we do not aim to fit the energy function directly but fit like AdaGrad (Duchi et al., 2011) or ADAM (Kingma &
the predictions made by the system to some target out- Ba, 2014). Further, a fast approximate solution to y ? can
puts, there are different possibilities for learning the ICNN be used instead of the exact solution.
parameters. One such method is based upon the max-
margin structured prediction framework (Tsochantaridis
G. Proof of Proposition 3
et al., 2005; Taskar et al., 2005). Given some training ex-
ample (x, y ? ), we would like to require that this example Proof (of Proposition 3). We have by the chain rule that
has a joint energy that is lower than all other possible val-
ues for y. That is, we want the function f˜ to satisfy the
 
∂` ∂` ∂ ŷ ∂G ∂ ŷ ∂h
constraint = + . (32)
∂θ ∂ ŷ ∂G ∂θ ∂h ∂θ
f˜(x, y ? ; θ) ≤ min f˜(x, y; θ) (28)
y
The challenging terms to compute in this equation are the
∂ ŷ ∂ ŷ
Unfortunately, these conditions can be trivially fit by ∂G and ∂h terms. These can be computed (although we
choosing a constant f˜ (although the entropy term allevi- will ultimately not compute them explicitly, but just com-
ates this problem slightly, we can still choose an approx- pute the product of these matrices and other terms in the
imately constant function), so instead the max-margin ap- Jacobian), by implicit differentiation of the KKT condi-
proach adds a margin-scaling term that requires this gap to tions. Specifically, the KKT conditions of the bundle en-
be larger for y further from y ? , as measured by some loss tropy method (considering only the active constraints at the
Input Convex Neural Networks: Supplementary Material

solution) are given by H. State and action space sizes in the OpenAI
gym MuJoCo benchmarks.
1 + log ŷ − log(1 − ŷ) + GT λ = 0
Gŷ + h − t1 = 0 (33) Environment # State # Action
T
1 λ = 1. InvertedPendulum-v1 4 1
InvertedDoublePendulum-v1 11 1
For simplicity of presentation, we consider first the Jaco- Reacher-v1 11 2
bian with respect to h. Taking differentials of these equa- HalfCheetah-v1 17 6
tions with respect to h gives Swimmer-v1 8 2
Hopper-v1 11 3

1 1
 Walker2d-v1 17 6
diag + dy + GT dλ = 0 Ant-v1 111 8
ŷ 1 − ŷ
(34) Humanoid-v1 376 17
Gdy + dh − dt1 = 0 HumanoidStandup-v1 376 17
1T dλ = 0
Table 4. State and action space sizes in the OpenAI gym MuJoCo
or in matrix form benchmarks.
   
1
diag ŷ1 + 1−ŷ GT
  
0 dy 0 I. Synthetic classification examples
−1  dλ  =  −dh  .
 
 G 0
0 0 −1T dt 0 We begin with a simple example to illustrate the classi-
(35) fication performance of a two-hidden-layer FICNN and
To compute the Jacobian ∂h∂ ŷ
we can solve the system above PICNN on two-dimensional binary classification tasks
with the right hand side given by dh = I, and the resulting from the scikit-learn toolkit (Pedregosa et al., 2011). Fig-
dy term will be the corresponding Jacobian. However, in ure 4 shows the classification performance on the dataset.
our ultimate objective we always left-multiply the proper The FICNN’s energy function which is fully convex in
terms in the above equation by ∂∂`ŷ . Thus, we instead define X × Y jointly is able to capture complex, but sometimes
restrictive decision boundaries. The PICNN, which is non-
    −1   convex over X but convex over Y overcomes these restric-
cy

diag ŷ1 + 1
1−ŷ
GT 0 −( ∂∂`ŷ )T
 cλ  = 
 G 0
 
−1  0  tions and can capture more complex decision boundaries.
ct 0 −1T 0 0
(36) J. Multi-Label Classification Training Plots
and we have the the simple formula for the Jacobian prod-
uct In Figure 5.
∂` ∂ ŷ
= (cλ )T . (37)
∂ ŷ ∂h K. Image Completion

A similar set of operations taking differentials with respect The losses are in Figure 6.
to G leads to the matrix equations
   
GT
1 1
  
diag ŷ
+ 1−ŷ
0 dy −dGT λ
−1   dλ  =  −dGy 
 
 G 0
0 −1T 0 dt 0
(38)
and the corresponding Jacobian products / gradients are
given by
∂` ∂ ŷ
= cy λT + ŷ(cλ )T . (39)
∂ ŷ ∂G
Finally, using the definitions that

giT = ∇y f (x, y i ; θ)T , hi = f (x, y k ; θ)−∇y f (x, y i ; θ)T y i


(40)
we recover the formula presented in the proposition.
Input Convex Neural Networks: Supplementary Material

Figure 4. FICNN (top) and PICNN (bottom) classification of synthetic non-convex decision boundaries. Best viewed in color.

Figure 5. Training (blue) and test (red) macro-F1 score of a feedforward network (left) and PICNN (right) on the BibTeX multi-label
classification dataset. The final test F1 scores are 0.396 and 0.415, respectively. (Higher is better.)

Figure 6. Mean Squared Error (MSE) on the train (blue, rolling over 1 epoch) and test (red) images from Olivetti faces for PICNNs
trained with the bundle entropy method (left) and back optimization (center), and back optimization with the convexity constraint
relaxed (right). The minimum test MSEs are 833.0, 872.0, and 850.9, respectively.

You might also like