Professional Documents
Culture Documents
Visual Data Synthesis Via GAN For Zero-Shot Video Classification
Visual Data Synthesis Via GAN For Zero-Shot Video Classification
Visual Data Synthesis Via GAN For Zero-Shot Video Classification
Abstract video instances of some categories are quite rare. (2) Partic-
ular models learned on limited categories are hard to expand
Zero-Shot Learning (ZSL) in video classification automatically as video categories grow.
is a promising research direction, which aims to To overcome such restrictions, Zero-Shot Learning (ZSL)
tackle the challenge from explosive growth of video has been studied widely [Zhang and Saligrama, 2015; Chang-
categories. Most existing methods exploit seen- pinyo et al., 2016; Xu et al., 2017; Zhang et al., 2017], which
to-unseen correlation via learning a projection be- aims to construct classifiers dynamically for novel categories.
tween visual and semantic spaces. However, such Instead of treating each category independently, ZSL adopts
projection-based paradigms cannot fully utilize the semantic representation (e.g., attribute or word vector) as side
discriminative information implied in data distri- information, to associate categories in source domain and tar-
bution, and commonly suffer from the information get domain for semantic knowledge transfer. Existing zero-
degradation issue caused by “heterogeneity gap”. shot classification methods focus primarily on still images.
In this paper, we propose a visual data synthesis In this paper, we focus on zero-shot classification in videos,
framework via GAN to address these problems. which is more needed as novel video categories emerge on
Specifically, both semantic knowledge and visual the web everyday. Furthermore, compared to image, zero-
distribution are leveraged to synthesize video fea- shot video classification has its own characteristics and thus
ture of unseen categories, and ZSL can be turned is more challenging. First, video data contains more noise
into typical supervised problem with the synthetic than image, setting a higher requirement on the robustness
features. First, we propose multi-level semantic of zero-shot classification models. Second, video feature de-
inference to boost video feature synthesis, which scribes both spatial and temporal information, whose mani-
captures the discriminative information implied in fold is more complex. Third, video context with various poses
joint visual-semantic distribution via feature-level and appearances can be changeable, leading to that video dis-
and label-level semantic inference. Second, we tribution is much more long-tailed than that of image.
propose Matching-aware Mutual Information Cor- Most existing ZSL methods learn a projection among dif-
relation to overcome information degradation is- ferent representation (i.e., visual, semantic or intermediate)
sue, which captures seen-to-unseen correlation in spaces based on the data of seen domain, and the learned
matched and mismatched visual-semantic pairs by projection will be applied directly to unseen domain during
mutual information, providing the zero-shot syn- testing. However, such projection-based methods have sev-
thesis procedure with robust guidance signals. Ex- eral shortcomings and thus not robust enough for video ZSL.
perimental results on four video datasets demon- On the one hand, they exploit seen-to-unseen correlation only
strate that our approach can improve the zero-shot from the semantic knowledge aspect, while ignoring the dis-
video classification performance significantly. criminative information implied in visual data distribution. In
fact, intrinsic visual distribution play a vital role in zero-shot
video classification, since the most discriminative informa-
1 Introduction tion is derived from visual feature space [Long et al., 2017].
In the past decade, supervised video classification methods On the other hand, the inconsistency between visual feature
have achieved significant progress due to deep learning tech- and semantic representation causes heterogeneity gap, and
niques and large-scale labeled datasets. However, with the the projection between heterogeneous spaces leads to an in-
number of video categories growing rapidly, classic super- formation loss such that seen-to-unseen correlation would de-
vised frameworks suffer from the following challenges: (1) grade. A specific affect caused by this gap is hubness prob-
They rely on large-scale labeled data heavily, while collect- lem [Radovanović et al., 2010], where some irrelevant cate-
ing labels for video is laborious and costly, as well as the gory prototypes become near neighbors of each other, as the
projection is performed in high-dimensional spaces.
∗
Corresponding author. In this paper, we introduce Generative Adversarial Net-
works (GANs) [Goodfellow et al., 2014] to realize zero-shot CONSE [Norouzi et al., 2014]. In these approaches, poste-
video classification from the perspective of generation, which rior of seen categories is predicted firstly, then the attribute
bypasses the above limitations of explicit projection learn- classifiers are learned by the principle of maximizing pos-
ing. The key idea is to model the joint distribution over terior estimation. PAC has been proved to be poor in ZSL
high-level video feature and semantic knowledge via adver- tasks as it ignores the relation of different attributes [Jayara-
sarial learning, where discriminative information and seen-to- man et al., 2014]. After that, research efforts shift to an-
unseen correlation are embedded effectively for novel video other direction which directly builds a projection from vi-
feature synthesis. Once the training procedure is done, un- sual feature space to semantic (e.g. attribute or word-vector)
seen video feature can be synthesized by the generator, and space. In this paradigm, linear [Palatucci et al., 2009], bilin-
zero-shot classification is realized in video feature space in a ear [Romera-Paredes and Torr, 2015; Zhang and Saligrama,
supervised fashion. Namely, synthetic video features are uti- 2016] and nonlinear [Socher et al., 2013; Zhang et al., 2017]
lized to train a conventional supervised classifier (e.g., SVM), compatibility models are explored widely by optimizing spe-
or serve directly to the simplest nearest neighbor algorithm. cific loss functions. Besides visual → semantic mapping,
However, synthesizing discriminative video feature based other two mapping directions are studied, namely semantic
on semantic knowledge is non-trivial for prevalent adversarial → visual mapping and embedding both visual and seman-
learning framework such as conditional GAN (cGAN) [Mirza tic features into another shared space [Zhang and Saligrama,
and Osindero, 2014]. The main challenges we encountered 2015]. Nowadays, various hybrid models [Akata et al., 2016;
are two-fold: (1) How to model the joint distribution over Changpinyo et al., 2016] are also studied based on such di-
video feature and semantic knowledge robustly and ensure verse selection of embedding space.
the discriminative characteristics of the synthetic feature? (2) Recently, Unseen Visual Data Synthesis (UVDS) [Long et
How to mitigate the impact of heterogeneity gap and transfer al., 2017] views ZSL from a new perspective, i.e., it tries to
semantic knowledge maximally? synthesize visual feature of unseen instances for converting
To tackle the first challenge, we propose multi-level seman- ZSL to a typical supervised problem. However, it is still an
tic inference approach, aiming to fully utilized the distribution explicit projection learning framework in which visual feature
over video feature and semantic knowledge for discriminative is synthesized through embedding-based matrix mapping. On
information mining. It contains two opposite synthesis pro- the one hand, such explicit projection paradigm is hard to pre-
cedures driven by adversarial learning, where the semantic- serve manifold information of visual feature space, or capture
to-visual branch synthesizes video feature given semantic unseen-to-seen correlation for zero-shot generalization. On
knowledge, and the visual-to-semantic branch inversely in- the other hand, the visual feature synthesized by UVDS suf-
fers the semantic knowledge at both feature-level and label- fers variance decay issue [Long et al., 2017], which is not
level. Such two-pronged semantic inference forces the gen- robust enough for high-dimensional video feature synthesis.
erator to capture the discriminative attributes for visual- In this paper, we address these problems with deep generative
semantic alignment, and bidirectional synthesis procedures models.
boost each other collaboratively for ensuring robustness of
the synthetic video feature. 2.2 Generative Adversarial Networks
To tackle the second challenge, we propose matching- As one of the most potential generative models, Generative
aware mutual information correlation, which provides the Adversarial Networks (GANs) [Goodfellow et al., 2014] are
synthesis procedure with informative guidance signals for studied extensively recently. The goal of GAN is to learn a
overcoming the information degradation issue. Instead of di- generator distribution Pg (x) that matches the real data distri-
rect feature projection, the mutual information hints among bution Preal (x) through a minimax game:
matched and mismatched visual-semantic pairs are utilized
for semantic knowledge transfer. Therefore, statistical de- min max V (G, D) = Ex∼Preal [log D(x)]
G D
pendence among heterogeneous representations can be cap-
tured, bypassing the information degradation issue in typical + Ez∼Pnoise [log (1 − D(G(z)))]
projection-based ZSL methods.
To verify the effectiveness of the proposal, we conduct ex- Basic GAN is unrestricted and uncontrollable during train-
tensive experiments on 4 widely-used video datasets, and the ing, as well as the priori noise z is difficult to explain.
experimental results demonstrate that our approach improves To tackle these defects, a surge of variants of GAN have
the zero-shot video classification performance significantly. emerged, such as conditional GAN (cGAN) [Mirza and Osin-
dero, 2014], InfoGAN [Chen et al., 2016] and Wasserstein
2 Related Work GAN (W-GAN) [Arjovsky et al., 2017]. cGAN tries to solve
the controllability issue via providing class labels to both gen-
2.1 Zero-shot Learning erator and discriminator as condition. InfoGAN aims to learn
Zero-shot learning has been drawn a wide attention, due to more interpretable representation by maximizing the mutual
its potentiality on scaling to large novel categories in classi- information between latent code and generator’s output. W-
fication tasks. Early explorations in ZSL are mainly focus GAN introduces earth mover distance to measure similarity
on learning probabilistic attribute classifiers (PAC), such as of real and fake distribution, which makes training stage more
Direct-Attribute Prediction (DAP) [Lampert et al., 2009], In- stable. Our work is related to cGAN and InfoGAN closely,
direct Attribute Prediction (IAP) [Lampert et al., 2014] and and details are discussed in the sequel.
Feature-level
inference Loss
z~ (0,1)
Semantic Space
Inference model(R)
Correlation model (Q)
!
!"# Correlation Loss
Generator (G)
Original D Loss
Feature
Extractor !"#
!
Discriminator (D)
Figure 1: Architecture of the proposed framework. A group of noise is utilized by generator to synthesize video feature Vg , which is used by
inference model R and discriminator D simultaneously to perform semantic inference and correlation constraint. Vr denotes the real visual
feature, emat and emis respectively denote the matched and mismatched semantic embedding sets corresponding to Vr .
3.5 Zero-shot Classification feature from last pooling layer are concatenated to form final
At the test stage, we simply use the nearest neighbor (NN) visual embedding. We adopt GloVe [Pennington et al., 2014]
search and SVM as classifiers for evaluating the discrimina- that trained on Wikipedia with more than 2.2 million unique
tive capability of synthetic video feature. For NN search, la- vocabularies to obtain semantic embedding and its dimension
bel in test set is predicted by: is 300. The dimension of Gaussian noise is 100 and the car-
dinality of noise set is set to 30. We train our framework for
− v ∗ ||
ŷ = arg min∗
||G z, g(y) (13) 300 epochs using the Adam optimizer with momentum 0.9.
y∈yte ,v ∈Vte
We initialize the learning rate to 0.01 and decay it every 50
where the G z, g(y) is the synthetic visual feature of label epochs by a factor of 0.5. Both λ1 and λ2 are set to 1.
y, v ∗ is the original visual feature in test set Vte and ŷ is the
predicted label.
4.3 Compared Methods
For experiments with SVM, we synthesize visual feature of We compare our framework with several ZSL methods in
same amount with unseen categories in test set. Then the syn- video: (1) Convex Combination of Semantic Embeddings
thetic feature of unseen categories and original visual feature (CONSE) [Norouzi et al., 2014]. CONSE is a posterior based
of seen categories are merged to train a SVM with 3rd -degree model which trains classifiers on seen categories, then the
polynomial kernel, whose slack parameter is set to 5. prediction models are built on linear combination of existing
posterior. (2) Structured Joint Embedding (SJE) [Akata et
4 Experiments al., 2015]. SJE optimizes the structural SVM loss to learn the
bilinear compatibility. This model utilized bilinear ranking
4.1 Datasets and Settings to maximize the score among relevant labels and minimize
Datasets: Experiments are conducted on four popular the score among irrelevant labels. (3) Manifold regularized
video classification datasets, including HMDB51 [Kuehne ridge regression (MR) [Xu et al., 2017]. MR enhances the
et al., 2013], UCF101 [Soomro et al., 2012], Olympic convensional projection pipeline by manifold regularization,
Sports [Niebles et al., 2010] and Columbia Consumer Video self-training and data augmentation in transductive manner.
(CCV) [Jiang et al., 2011], which respectively contain 6.7k,
13k, 783 and 9.3k videos with 51, 101, 16 and 20 categories. 4.4 Experimental Results
Zero-Shot Settings: There are two ZSL settings, namely The experimental results are shown in Table 1. Note that
strict setting and generalized setting [Xian et al., 2017]. The FV denotes the Fisher Vectors encoded dense trajectory fea-
former assumes the absence of seen classes at test state while ture and DF denotes the deep video feature extracted by two-
the latter takes both seen and unseen data as test data dur- stream neural networks. From the results we draw several
ing testing. In this paper, we adopt strict setting for both NN conclusions: (1) All methods are far beyond the random guess
search and SVM experiments. bound, demonstrating the success of ZSL in video classifica-
Data Split: There are quite few zero-shot learning evalua- tion. (2) Our approach beats the most existing methods in
tion protocols for video classification in the community. Xu terms of average accuracy/mAP, while the standard deviation
et al. [Xu et al., 2017] established a baseline of this filed. In is slightly higher. This fact illustrates that generative frame-
order to compare our framework with the state-of-the-art, we work is not as stable as projection-based methods, such as
follow the data splits proposed by [Xu et al., 2017]: 50/50 SJE and MR. (3) Compared to SVM, NN search suffers from
for every dataset, i.e., visual feature of 50% categories are more hubness problem and thus achieves suboptimal results.
used for model training and the other 50% categories are held However, NN that based on our approach performs better than
unseen until test time. We take the average accuracy and CONSE and SJE without extra pay, indicating the discrimi-
standard deviation as evaluation metrics and report the results native power of synthetic feature generated by our approach.
over 50 independent splits generated randomly. (4) Our approach can improve the zero-shot video classifica-
tion performance significantly both on traditional visual fea-
4.2 Implementation Details ture and deep video feature, indicating that our proposal is
Our model is implemented with PyTorch1 . Both traditional robust enough to capture the main discriminative information
feature and deep feature are investigated as video feature. of different visual feature.
For the former, similar to [Xu et al., 2017], we extract im-
4.5 Ablation Studies
proved trajectory feature (ITF) with three descriptors (HOG,
HOF and MBH) for each video, then encode them by Fisher We conduct the ablation studies to further evaluate the effect
Vectors (FV) and we get combined visual feature with 50688 of two components of our proposal and the results are exhib-
dimension. For the latter, we use the two-stream [Simonyan ited in Table 2. For the sake of simplicity, we only report the
and Zisserman, 2014] framework based on VGG-19 to extract results on UCF101 and CCV, since similar results are yield
spatial-temporal feature of videos, and frame and optical flow on the other two datasets. From Table 2, we can draw the
following observations: (1) With the aid of semantic condi-
1
http://pytorch.org/ tion, synthetic video feature exceeds the real feature, suggest-
Table 1: Experimental results of our approach and comparison to state-of-the-art for ZSL video classification on four datasets. Average %
accuracy ± standard deviation for HMDB51 and UCF101, mean average precision ± standard deviation for Olympic Sports and CCV. All
experiments use same word-vector as semantic embedding for fair consideration. Random guess builds the lower bound for each dataset. FV
and DF denote the traditional and deep video feature respectively.
Figure 2: Visualization of the traditional visual feature distribution of 8 random unseen classes on Olympic Sports dataset. Different classes
are shown in different colors.