Lecture 5 - Adversarial Networks and Variants

Adversarial Networks G Variants
f-
Background Reading : GANS ,
GAN ,
,
conditional mutual
Information .
Lipschitz functions primal ,

-
Dual optimization
Based the
principles adversarial
learning
basic
on
of .
multiple improvisations have been proposed in the literature
we shall
study a
few of them namely ,
Info GAN ,
BIGAN ,
cycle GAN , Style GANG WGAN .

Info GAN
Objective : To learn a GAN with a latent space

that
is
semantically disentangled .
Proposal : The input noise vector to the

generator is
decomposed into two parts : 2 -
incompressible noise
C- structured latent code .
denoted variables with

C is
by L latent Ci , ↳ ,
.
.
.
.ec ,
distribution by
a
factor given
P( ci.cz ,
. . .
,
CD I
#
•
Pec:)
1=1
Since the Generator network takes both 2 Eec as

inputs ,
denoted G (2) c) GAN , the

it is
by 0
.
In a standard
latent code C
may
be
ignored E there is no
way
to
enforce
that the
generated distribution should utilize
both 2 & C.
Thus ,
Info GAN proposes an
information theoretic
regularization
the standard GAN
objective
over as
follows :
L±npaAN = minimax
O w
Flo ,w) -
II. ( c;
Golz > G)
Here Icc ; )
Go is the mutual
information between the
latent codes { the Generator distribution .

Variational Mutual
Information Maximization .
mutual Go)
In
practice , the
information term Idc ,
is not
possible to be
directly optimized as it requires access to
the latent posterior pcclx≥ .

Thus , a variational lower bound
is
optimized instead as
follows :
Let q( clx) be the variational approximation to
p (CIA)
I ( c ; G) =
It (c) -
H ( 4G )
=
Ea [ e-
.
↳
8
Plc'l✗) ] -1 It (c)
cinpcc ,,
¥14 [ Eup ( it ) + DKL (Pkk) 110114# + It (c)

, log
=
×
or
,,,
×Ea [E. log 9101×7] -1 (c) ÷ Deal ≥

≥ It 0 .
→ a ,×,
The above term however needs samples from pcclx)
compute the which avoided

to inner
expectation ,
is
using
the trick
following .
E- E 964×1 E- ( Ctx) (1)

by tog
= -
✗na duplex, en Pas ,

✗ na
Lemma 5.1
from in
the
info GAN paper .
The Expectation in the RHS
of Eq . I can be computed
Monte Carlo estimates

using .
In
practice ,
the distribution
of
is approximated using
another
neural network in addition to the Generator E Discriminator
networks { Lin
paan is
optimized .
Post
training ,
it is shown that variation in a
single
component of C
, corresponds to variation in a
single semantic
data space
the
generated
.
in
factor
Bidirectional GANS ( BIG AND
Objective : Standard GANS do not have a means to
learn the inverse the data space

mapping from
to the latent space . The objective of a BIGAN
is to learn both the latent space

mappings from
data
to space & vice-versa
simultaneously .
Proposal : In addition to the standard Generator , an
Encoder network , E : ✗ → Z is trained .
∅
Let peczlx) denote the density induced
by the Encoder
The standard discriminator also

network .
is
modified
to predict p (71×12) where 7=1 npx Gcx)
if ✗ E
if
o ✗ -
.
With this the

,
BIGAN
Optimizes following objective :
↳iaan .
= min Max F (O ,
W
, )
0,10 w
[E
bgtk-D-E.IE?y;i?!!I?..fYD-
Flo , wit ) =
#
÷;:EÉ
+*
It is shown that the optimal point is reached when
where
Pex =
Paz
PEX =L ,
B- [ PE ( 2K) dzdx
Paz
=
{ Pa / ✗
PALME) dxdz
BIGAN the JS divergence between the

Optimizes for joint
distributions over the data E the latent spaces .
Cycle -
GAN
Objective : To learn to translate between the

distributions
samples of a
pair of _
Proposal : use adversarial

learning
in a conditional
the distribution is
setting
where in source
used GAN instead

as
input for a
of the Noise
In addition
variable .
, incorporate a
two-way
enforces transitivity
loss that
consistency .
Given a
pair of domains XGY with Px Er Py as densities ,
≥
Ax
> Cycle GAN has two
% Py mapping functions
^
cry { Gx : ✗→
Y , Gy : Y→ ×
which
Cycle GAN are
simultaneously
-
learned with the

,
along
discriminator
corresponding
functions Dy Ee Dx ,
respectively
In addition to the usual GAN loss a
cycle Consistency
-
loss introduced
for enforcing transitivity
is .
Lucie cons .
= E
x-P ×
[ 11 Gyltrxlxi) -
✗ 11 ]
,
+ ☒
Yup,
[ 116×(441-411)
Therefore the
final objective function is as
follows :
Lcyueaan E
log [ Dy°G✗( )] -1¥ tog Dy th
-1¥ hog G) 1-
= 1- × 1- DIG
>
P☒
)
☒
Px wgD✗(✗ +
Lyde cons .
Style -
GAN
Objective : To learn unsupervised attribute separation or
disentanglement in the
generated space of a GAN .
Proposal : Learn
multiple latent vectors
corresponding
to
different possible styles in the
feature
space the Generator
of .
Given a latent code 2
,
first a latent
transformation
is learned via a network f : 2- → W
EW
Subsequently ,
the w vector is
fed separately
to each
of
the
feature maps
in the
generator via
adaptive instance ( Ada IN) layer

an
normalization as
follows :
✗i tlxi )
Ada IN (xi y) y,
-
,
= Y, +
ocxi )
where Xi is the ith
feature map of the generator ,
with
ocxi) the statistics

µC✗i) Er
being corresponding .
4 =
( % %)
,
is the output of an
affine transformation
layer with was the
input .
modification generator
Note that in the
style GAN
is a
architecture but does not involve

any
loss / metric modification .
Twas se-rste.in GAN
GANS the
in naive
formulation are known to be
very
This the
unstable to train .
is ascribed to non -
alignment
the that the supports the
of manifolds forms of
distributions the
between which
divergence is calculated .
It shown that distributions whose
is
for
a
pair of
do not dont
supports full dimension E
perfectly align .
the usual f- divergences such as JSD

forward / reverse KLD
-
,
,
will be maxed out with the existence discrimi

of a
perfect
-
nator This between

calls out
divergence metric
'
softer
'
for
-
distributions does not when the

that max out
manifolds of
the do
supports not
perfectly align .
Earth Movers or Wasserstein's distance :
Let P GOL denote two distributions over a

space ✗ .
P
The Wasserstein 's distance between GOV is
defined as
WCPIIQ ) =
int E
(x ,y)n ,
[11×-411]
→ €1T (P Q)
,
Here the is all joint distributions
,
infimum over set
of
11-(1%01) whose
marginals are
respectively REQ .
0( ✗ iy) is the amount of

"
mass
"
that is
transported from
to Y in order to P to 0L , which when

✗
transform
multiplied with 11×-411 the amount

'
specifies
work
'
of
done in the said transportation .

Thus
" "
EMD is the the

,
or WD cost incurred
by
optimal transport plan Note that 0 Can be
every
.
which the
thought of as a
transport plan out
of
the IT
optimal is
sought out in EMD via
infimum over .
For instance
'
EMD shown to
good properties
'
is
posess some .
suppose p ,
be a
density over X .
2 be a RV over Z .
Let : ZX Rd → ✗ be a
parametric function [ neural network
]
g0
with distribution Then it can shown that
a
% .
,
if
go
is continuous in 0 ,
so is W( Pr , Po) ,
unlike JSDCP .
> Po)
and KL ¢12 ,
Po) .
Therefore it is desirable to use the
Wasserstein 's distance than

to learn
generative samplers
t However the
any of
the
divergence metric
infimum
in
-
the the
definition of WD is intractable ,
albeit a dual
used the
definition may
be to
optimize WD in practice .
NGAN :
The duality dual

kantrovich Rubinstein
provides a
definition
-
the
for WD as
follows :
w
(17×112) =
Sup
111-11 , ≤I
E-
✗-
Px
(1-6)/-4=(1-1×1)
xnpo
The f- ☒
supremum is over all the 1-
Lipschitz functions : ✗ → .
the
Typically , function 1- is
approximated by a neural
network called the critic network Er is

replaced
supremum
by the maximum over the

parameters of the neural network .
The distribution Po is
approximated sample
using a _ or
generator neural network

go
(2) ,
2nA /0 ,
I] .
With these ,
the objective for a wart N to

optimize will be as
follows :
Lwaan =
min Max E- 1×7 ] E
[to ( go
-
1W ✗ nP*
@ Zupz
Ilfw / tell
In neural which made
practice , fw is a network is
Lipschitz by weight clipping after every gradient update or
weight regularization .
References
1. https://arxiv.org/abs/1606.03657
7. https://vincentherrmann.github.io/blog/wasserstein

Lecture 5 - Adversarial Networks and Variants

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 5 - Adversarial Networks and Variants

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 5 - Adversarial Networks and Variants

Uploaded by

Copyright:

Available Formats

Adversarial Networks G Variants

Lipschitz functions primal ,

multiple improvisations have been proposed in the literature

cycle GAN , Style GANG WGAN .

Objective : To learn a GAN with a latent space

Proposal : The input noise vector to the

decomposed into two parts : 2 -

C- structured latent code .

denoted variables with

Since the Generator network takes both 2 Eec as

denoted G (2) c) GAN , the

latent codes { the Generator distribution .

the latent posterior pcclx≥ .

¥14 [ Eup ( it ) + DKL (Pkk) 110114# + It (c)

×Ea [E. log 9101×7] -1 (c) ÷ Deal ≥

compute the which avoided

E- E 964×1 E- ( Ctx) (1)

✗na duplex, en Pas ,

Monte Carlo estimates

neural network in addition to the Generator E Discriminator

Objective : Standard GANS do not have a means to

learn the inverse the data space

is to learn both the latent space

Proposal : In addition to the standard Generator , an

Encoder network , E : ✗ → Z is trained .

The standard discriminator also

With this the

BIGAN the JS divergence between the

Objective : To learn to translate between the

Proposal : use adversarial

used GAN instead

learned with the

Objective : To learn unsupervised attribute separation or

adaptive instance ( Ada IN) layer

ocxi) the statistics

architecture but does not involve

the usual f- divergences such as JSD

will be maxed out with the existence discrimi

nator This between

distributions does not when the

Earth Movers or Wasserstein's distance :

Let P GOL denote two distributions over a

0( ✗ iy) is the amount of

to Y in order to P to 0L , which when

multiplied with 11×-411 the amount

done in the said transportation .

EMD is the the

Therefore it is desirable to use the

Wasserstein 's distance than

The duality dual

network called the critic network Er is

by the maximum over the

generator neural network

the objective for a wart N to

Lipschitz by weight clipping after every gradient update or

You might also like