Professional Documents
Culture Documents
Improving GAN With Neighbors Embedding and Gradient Matching
Improving GAN With Neighbors Embedding and Gradient Matching
5191
discriminator often converges much faster than the genera- errors. BEGAN (Berthelot, Schumm, and Metz 2017) ex-
tor. Some regularization techniques are weight-clipping (Ar- tends EBGAN by optimizing Wasserstein distance between
jovsky and Bottou 2017), gradient penalty constraints (Gul- AE loss distributions.
rajani et al. 2017; Roth et al. 2017; Kodali et al. 2017;
Petzka, Fischer, and Lukovnicov 2017; Liu 2018), con- Proposed method
sensus constraint, (Mescheder, Nowozin, and Geiger 2017; Our proposed system with gradient matching (GM) and
Mescheder, Geiger, and Nowozin 2018), or spectral norm neighbor embedding (NE) constraints, namely GN-GAN,
(Miyato et al. 2018). However, over-constraint of the dis- consists of three main components: the auto-encoder, the
criminator may cause the cycling issues (Nagarajan and discriminator, and the generator. In our model, we first train
Kolter 2017; Mescheder, Geiger, and Nowozin 2018). the auto-encoder, then the discriminator and finally the gen-
Issues of GAN can also be tackled via the optimizer erator as presented in Algorithm 1.
regularization: changing optimization process (Metz et al.
2017), using two-time scale update rules for better conver- Algorithm 1 Our GN-GAN model
gence (Heusel et al. 2017), or averaging network parameters
1: Initialize discriminator, encoder and generator D, E, G re-
(Yazıcı et al. 2018).
spectively. Niter is the number of iterations.
Regularizing the generator is another direction: i) It can be 2: repeat
achieved by modifying the generator objective function with 3: x ← Random mini-batch of m data points from dataset.
feature matching (Salimans et al. 2016) or discriminator- 4: z ← Random n samples from noise distribution Pz
score distance (Tran, Bui, and Cheung 2018) ii) Or, using 5: // Training the auto-encoder using x and z by Eqn. 1
Auto-Encoders (AE) or latent codes to regularize the gen- 6: E, G ← min VAE (E, G)
erator. AAE (Makhzani et al. 2016) uses AE to constrain 7: // Training discriminator according to Eqn. 7 on x, z
the generator. The goal is to match the encoded latent dis- 8: D ← max VD (D, G)
tribution to some given prior distribution by the minimax 9: // Training the generator on x, z according to Eqn. 13.
10: G ← min VG (D, G)
game. The problem of AAE is that pixel-wise reconstruc-
11: until Niter
tion with `2 -norm would cause the blurry issue. And the 12: return D, E, G
minimax game on the latent samples has the same problems
(e.g., mode collapse) as on the data samples. It is because
AE alone is not powerful enough to overcome these issues.
VAE/GAN (Larsen et al. 2015) combined VAE and GAN Neighbors embedding constraint for Auto-encoder
into one single model and used feature-wise distance for We use auto-encoder (AE) in our model for two reasons: i)
the reconstruction to avoid the blur. The generator is reg- to prevent the generator from being severely collapsed. ii)
ularized in the VAE model to reduce the mode collapse. to regularize the generator in producing samples that resem-
Nevertheless, VAE/GAN has the similar limitation of VAE ble real ones. However, using AE alone is not adequate to
(Kingma and Welling 2013), including re-parameterization avoid mode collapse, especially for high-dimensional data.
tricks for back-propagation, or, requirement to access to an Therefore, we propose additional regularization as in Eq. 1:
exact functional form of prior distribution. ALI (Dumoulin
et al. 2016) and BiGAN (Donahue, Krähenbühl, and Darrell
VAE (E, G) = ||x − G(E(x))||2 + λr VR (E, G) (1)
2016) jointly train the data/latent samples in GAN frame-
work. This method can learn the AE model implicitly after Eq. 1 is the objective of our regularized AE. The first term
training. MDGAN (Che et al. 2016) required two discrim- is reconstruction error in conventional AE. The second term
inators for two separate steps: manifold and diffusion. The VR (E, G) is our proposed neighbors embedding constraint,
manifold step manages to learn a good AE. The diffusion to be discussed. Here, G is GAN generator (decoder in AE),
step is similar to the original GAN, except that the con- E is the encoder and λr is a constant.
structed samples are used as real samples instead. InfoGAN Mode collapse is a failure case of GAN when the gener-
(Chen et al. 2016) learned the disentangled representation ator often generates similar samples. The diversity of gen-
by maximizing the mutual information for inducing latent erated samples is small compared with those of the original
codes. MMGAN (Park et al. 2018) makes strong assump- dataset. As discussed in previous work (e.g. (Tran, Bui, and
tion that manifolds of real and fake samples are spheres. Cheung 2018)), with mode collapse, the generator would
First, it aligns real and fake sample statistics by matching map two far-apart latent samples to nearby data points in
the two manifold spheres (centre and radius), and then it ap- the high-dimensional data space with high probability. This
plies correlation matrix to reduce mode collapse. Dist-GAN observation motivates our idea to constrain the distances be-
(Tran, Bui, and Cheung 2018) constrains the generator by tween generated data points in order to alleviate mode col-
the regularized auto-encoder. Furthermore, the authors use lapse. In particular, the data point distances and the corre-
the reconstructed samples to regularize the convergence of sponding latent sample distances should be consistent.
the discriminator. The motivation of our neighbors-embedding constraint
Auto-encoder can be also used in the discriminator objec- VR (E, G) is to constrain the relative distance among data
tives. EBGAN (Zhao, Mathieu, and LeCun 2017) introduces points and their corresponding latent points within the data
the energy-based model, in which the discriminator is con- and latent manifold respectively (Fig. 1). In our model, we
sidered as the energy function minimized via reconstruction apply the probabilistic relative distance (PRDist) in t-SNE
5192
where qj|i is the conditional probability of pairwise distance
between samples G(zj ) and the center point G(zi ), com-
puted as follow:
−1
(1 + ||G(zj ) − G(zi )||2 /2σx2 )
qj|i = P 2 2 −1
(6)
k6=i (1 + ||G(zk ) − G(zi )|| /2σx )
5193
using the scalar discriminator scores. In GAN, the discrimi-
nator differentiates real and fake samples. Thus, the discrim-
inator score D(x) can be viewed as the probability that sam-
ple x drawn from the real data distribution. Although exact
form of D(x) is unknown, but the scores D(x) at some data
points x (from training data) can be computed via the dis-
criminator network. Therefore, we align the distributions by
minimizing the difference between discriminator scores of
real and generated samples. In addition, we constrain the
gradients of these discriminator scores. These constraints
can be derived from Taylor approximation of discriminator
functions as followings.
Assume that the first derivative of D exists, and the train- Figure 2: We compare our method and Dist-GAN (Tran,
ing set has data samples {x}. For a sample point s, by first- Bui and Cheung 2018) on the 1D synthetic dataset of three
order Taylor expansion (TE), we can approximate D(s) with Gaussian modes. Figures are the last frames of the demo
TE at a data point x: videos (can be found here: https://github.com/tntrung/gan).
The blue curve is discriminator scores, the green and orange
D(s) = D(x) + ∇x D(x)(s − x) + (s, x) (9) modes are the training data the generated data respectively.
5194
every 10K steps with a base of 0.99. This decay rate is to
avoid the learning rate saturating too quickly that is not fair
for slow convergence methods. The mini-batch size is 128.
The training stops after 500 epochs.
In this experiment, we compare our model to several state-
of-the-art methods. ALI (Donahue, Krähenbühl, and Dar-
rell 2016), VAE-GAN (Larsen et al. 2015) and Dist-GAN
(Tran, Bui, and Cheung 2018) are recent works using en-
coder/decoder in their models. WGAN-GP (Gulrajani et al.
2017) is one of the state-of-the-arts. We also compare to
Figure 3: Examples of the number of modes (classes) and VAE-based methods: VAE (Kingma and Welling 2014) and
registered points of compared methods. β-VAE (Higgins et al. 2016). The numbers of covered (reg-
istered) modes and registered points during training are pre-
sented in Fig. 3. The quantitative numbers of last epochs
Table 1: Network structures for 1D synthetic data in our ex- are in Table 2. In this table, we also report the Total Vari-
periments. ation scores to measure the mode balance (Tran, Bui, and
din dout Nh dh Cheung 2018). The result for each method is the average of
eight runs. Our method outperforms all others on the num-
Encoder (E) 2 2 2 64 ber of covered modes. Although WGAN-GP and Dist-GAN
Generator (G) 2 2 2 64 are stable with larger networks and this experimental setup
Discriminator (D) 2 1 2 64 (Tran, Bui, and Cheung 2018), they are less stable with our
network architecture, miss many modes and sometimes di-
verge.VAE based method often address well mode collapse,
and the 2D experiment in the next section, we only evalu- but in our experiment setup where the small networks may
ate our model with gradient matching (+GM), since we find affect the reconstruction quality, consequently reduces their
that our new generator with gradient matching alone is al- performance. Our method does not suffer serious mode col-
ready good enough; neighbors embedding is more useful for lapse issues for all eight runs. Furthermore, we achieve a
high-dimensional data samples, as will be discussed. higher number of registered samples than all others. Our
method is also better than the rest with Total Variation (TV).
Synthetic 2D dataset In addition, we follow (Thanh-Tung, Tran, and Venkatesh
For 2D synthetic data, we follow the experimental setup on 2018) to explore the gradient map of the discriminator scores
the same 2D synthetic dataset (Tran, Bui, and Cheung 2018). of compared methods: standard GAN, WGAN-GP, Dist-
The dataset has 25 Gaussian modes in the grid layout (red GAN and ours as shown in Fig. 4. This map is important
points in Fig. 4) that contains 50K training points. We draw because it shows the potential gradient to pull the generated
2K generated samples for evaluating the generator. How- samples towards the real samples (red points). The gradient
ever, the performance reported in (Tran, Bui, and Cheung map of standard GAN is noisy, uncontrolled and vanished
2018) is nearly saturated. For example, it can re-cover en- for many regions. The gradient map of WGAN-GP has more
tirely 25 modes and register more than 90% of the total num- meaningful directions than GAN. Its gradient concentrates
ber of points. It’s hard to see the significant improvement of in the centroids (red points) of training data and has gradi-
our method in this case. Therefore, we decrease the number ents around most of the centroids. However, WGAN-GP still
of hidden layers and their number of neurons for networks has some modes where gradients are not towards the ground-
to be more challenging. For a fair comparison, we use equiv- truth centroids. Both Dist-GAN and our method show better
alent encoder, generator and discriminator networks for all gradients than WGAN-GP. The gradients of our method are
compared methods. more informative for the generator to learn when they guide
The detail of network architecture is presented in Table better directions to all real ground-truths.
1. din = 2, dout = 2, dh = 64 are dimensions of in-
put, output and hidden layers respectively. Nh is the num- CIFAR-10 and STL-10 datasets
ber of hidden layers. We use ReLU for hidden layers and For CIFAR-10 and STL-10 datasets, we measure the perfor-
sigmoid for output layers. To have a fair comparison, we mance with FID scores (Heusel et al. 2017). FID can de-
carefully fine-tune other methods to ensure that they can tect intra-class mode dropping, and measure the diversity as
perform their best on the synthetic data. For evaluation, a well as the quality of generated samples. We follow the ex-
mode is missed if there are less than 20 generated samples perimental procedure and model architecture in (Miyato et
registered in this mode, which is measured by its mean and al. 2018) to compare methods. FID is computed from 10K
variance of 0.01. A method has mode collapse if there are real samples and 5K generated samples. Our default param-
missing modes. For this experiment, the prior distribution eters are used for all experiments λp = 1.0, λr = 1.0, λ1m =
is the 2D uniform [−1, 1]. We use Adam optimizer with λ2m = 1.0. Learning rate, β1 , β2 for Adam is (lr = 0.0002,
learning rate lr = 0.001, and the exponent decay rate of β1 = 0.5, β2 = 0.9). The generator is trained with 350K
first moment β1 = 0.8. The parameters of our model are: updates for logarithm loss version (Eq. 7) and 200K for
λp = 0.1, λ1m = λ2m = 0.1. The learning rate is decayed “hinge” loss version (Eq. 8) to converge better. The dimen-
5195
Figure 4: Our 2D synthetic data has 25 Gaussian modes (red dots). The black arrows are gradient vectors of the discriminator
computed around the ground-truth modes. Figures from left to right are examples of gradient maps of GAN, WGAN-GP,
Dist-GAN and ours.
Table 2: Results on 2D synthetic data. Columns indicate the number of covered modes, and the number of registered samples
among 2000 generated samples, and two types of Total Variation (TV). We compare our model to state of the art models:
WGAN-GP and Dist-GAN.
Method #registered modes #registered points TV (True) TV (Differential)
GAN (Goodfellow et al. 2014) 14.25 ± 2.49 1013.38 ± 171.73 1.00 ± 0.00 0.90 ± 0.22
ALI (Donahue, Krähenbühl, and Darrell 2016) 17.81 ± 1.80 1281.43 ± 117.84 0.99 ± 0.01 0.72 ± 0.19
VAEGAN (Larsen et al. 2015) 12.75 ± 3.20 1042.38 ± 170.17 1.35 ± 0.70 1.34 ± 0.98
VAE (Kingma and Welling 2014) 13.48 ± 2.31 1265.45 ± 72.47 1.81 ± 0.71 2.16 ± 0.72
β-VAE (Higgins et al. 2016) 18.00 ± 2.33 1321.17 ± 95.61 1.17 ± 0.24 1.47 ± 0.28
WGAN-GP (Gulrajani et al. 2017) 21.71 ± 1.35 1180.25 ± 158.63 0.90 ± 0.07 0.51 ± 0.06
Dist-GAN (Tran, Bui, and Cheung 2018) 20.71 ± 4.42 1188.62 ± 311.91 0.82 ± 0.19 0.43 ± 0.12
Ours 24.39 ± 0.44 1461.83 ± 222.86 0.57 ± 0.17 0.31 ± 0.12
5196
Figure 6: Generated samples of our method. Two first samples are on CIFAR-10 with CNN and ResNet architectures, and the
last one is on STL-10 with CNN.
5197
References Lim, S. K.; Loo, Y.; Tran, N.-T.; Cheung, N.-M.; Roig, G.;
Arjovsky, M., and Bottou, L. 2017. Towards principled and Elovici, Y. 2018. Doping: Generative data augmentation
methods for training generative adversarial networks. arXiv for unsupervised anomaly detection. In Proceeding of IEEE
preprint arXiv:1701.04862. International Conference on Data Mining (ICDM).
Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein Liu, K. 2018. Varying k-lipschitz constraint for generative
generative adversarial networks. ICML. adversarial networks. arXiv preprint arXiv:1803.06107.
Bellemare, M. G.; Danihelka, I.; Dabney, W.; Mohamed, S.; Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using
Lakshminarayanan, B.; Hoyer, S.; and Munos, R. 2017. The t-sne. Journal of machine learning research 9(Nov):2579–
cramer distance as a solution to biased wasserstein gradients. 2605.
arXiv preprint arXiv:1705.10743. Makhzani, A.; Shlens, J.; Jaitly, N.; and Goodfellow, I. 2016.
Adversarial autoencoders. In International Conference on
Berthelot, D.; Schumm, T.; and Metz, L. 2017. Be-
Learning Representations.
gan: Boundary equilibrium generative adversarial networks.
arXiv preprint arXiv:1703.10717. Mescheder, L.; Geiger, A.; and Nowozin, S. 2018. Which
training methods for gans do actually converge? In Interna-
Che, T.; Li, Y.; Jacob, A. P.; Bengio, Y.; and Li, W. 2016.
tional Conference on Machine Learning, 3478–3487.
Mode regularized generative adversarial networks. CoRR.
Mescheder, L.; Nowozin, S.; and Geiger, A. 2017. The nu-
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever,
merics of gans. In Advances in Neural Information Process-
I.; and Abbeel, P. 2016. Infogan: Interpretable representa-
ing Systems, 1825–1835.
tion learning by information maximizing generative adver-
sarial nets. In Advances in Neural Information Processing Metz, L.; Poole, B.; Pfau, D.; and Sohl-Dickstein, J. 2017.
Systems, 2172–2180. Unrolled generative adversarial networks. ICLR.
Donahue, J.; Krähenbühl, P.; and Darrell, T. 2016. Adver- Miyato, T.; Kataoka, T.; Koyama, M.; and Yoshida, Y. 2018.
sarial feature learning. arXiv preprint arXiv:1605.09782. Spectral normalization for generative adversarial networks.
ICLR.
Dumoulin, V.; Belghazi, I.; Poole, B.; Lamb, A.; Arjovsky,
M.; Mastropietro, O.; and Courville, A. 2016. Adversarially Nagarajan, V., and Kolter, J. Z. 2017. Gradient descent
learned inference. arXiv preprint arXiv:1606.00704. gan optimization is locally stable. In Advances in Neural
Information Processing Systems, 5585–5595.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. f-gan: Train-
2014. Generative adversarial nets. In NIPS, 2672–2680. ing generative neural samplers using variational divergence
minimization. In NIPS.
Goodfellow, I. 2016. Nips 2016 tutorial: Generative adver-
sarial networks. arXiv preprint arXiv:1701.00160. Park, N.; Anand, A.; Moniz, J. R. A.; Lee, K.; Chakraborty,
T.; Choo, J.; Park, H.; and Kim, Y. 2018. MMGAN: mani-
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and fold matching generative adversarial network for generating
Courville, A. C. 2017. Improved training of wasserstein images. CoRR abs/1707.08273.
gans. In Advances in Neural Information Processing Sys-
tems, 5767–5777. Petzka, H.; Fischer, A.; and Lukovnicov, D. 2017. On
the regularization of wasserstein gans. arXiv preprint
Guo, Y., and Cheung, N.-M. 2018. Efficient and deep person arXiv:1709.08894.
re-identification using multi-level similarity. In CVPR.
Roth, K.; Lucchi, A.; Nowozin, S.; and Hofmann, T.
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and 2017. Stabilizing training of generative adversarial networks
Hochreiter, S. 2017. Gans trained by a two time-scale update through regularization. In Advances in Neural Information
rule converge to a local nash equilibrium. In Advances in Processing Systems, 2018–2028.
Neural Information Processing Systems, 6626–6637.
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Rad-
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; ford, A.; and Chen, X. 2016. Improved techniques for train-
Botvinick, M.; Mohamed, S.; and Lerchner, A. 2016. beta- ing gans. In NIPS, 2234–2242.
vae: Learning basic visual concepts with a constrained vari-
Thanh-Tung, H.; Tran, T.; and Venkatesh, S. 2018. On catas-
ational framework. ICLR.
trophic forgetting and mode collapse in generative adversar-
Kingma, D. P., and Welling, M. 2013. Auto-encoding vari- ial networks. In Workshop on Theoretical Foundation and
ational bayes. arXiv preprint arXiv:1312.6114. Applications of Deep Generative Models.
Kingma, D. P., and Welling, M. 2014. Auto-encoding vari- Tran, N.-T.; Bui, T.-A.; and Cheung, N.-M. 2018. Dist-gan:
ational bayes. ICLR. An improved gan using distance constraints. In ECCV.
Kodali, N.; Abernethy, J.; Hays, J.; and Kira, Z. 2017. Yazıcı, Y.; Foo, C.-S.; Winkler, S.; Yap, K.-H.; Piliouras, G.;
On convergence and stability of gans. arXiv preprint and Chandrasekhar, V. 2018. The unusual effectiveness of
arXiv:1705.07215. averaging in gan training. arXiv preprint arXiv:1806.04498.
Larsen, A. B. L.; Sønderby, S. K.; Larochelle, H.; and Zhao, J.; Mathieu, M.; and LeCun, Y. 2017. Energy-based
Winther, O. 2015. Autoencoding beyond pixels using a generative adversarial network. ICLR.
learned similarity metric. arXiv preprint arXiv:1512.09300.
5198