Professional Documents
Culture Documents
Janwe 2017
Janwe 2017
Janwe 2017
DOI 10.1007/s10489-017-1033-x
Abstract Describing visual contents in videos by seman- Keywords Semantic video concept detection · Foreground
tic concepts is an effective and realistic approach that can driven concept co-occurrence matrix · Convolutional neural
be used in video applications such as annotation, index- network · Deep learning · Multi-label classification ·
ing, retrieval and ranking. In these applications, video data Asymmetric training
needs to be labelled with some known set of labels or
concepts. Assigning semantic concepts manually is not fea-
sible due to the large volume of ever-growing video data. Abbreviations
Hence, automatic semantic concept detection of videos is
CCM concept co-occurrence matrix
a hot research area. Recently Deep Convolutional Neu-
FDCCM foreground driven concept co-occurrence
ral Networks (CNNs) used in computer vision tasks are
matrix
showing remarkable performance. In this paper, we present
CNN convolutional neural network
a novel approach for automatic semantic video concept
detection using deep CNN and foreground driven con-
cept co-occurrence matrix (FDCCM) which keeps fore- 1 Introduction
ground to background concept co-occurrence values, built
by exploiting concept co-occurrence relationship in pre- Due to the advancement of networking and multimedia
labelled TRECVID video dataset and from a collection technologies, use of multimedia data increased exponen-
of random images extracted from Google Images. To deal tially, and video indexing and retrieval is seen as one of
with the dataset imbalance problem, we have extended this the most challenging issues in the area. Users always desire
approach by making a fusion of two asymmetrically trained to retrieve videos on the basis of some semantic objects
deep CNNs and used FDCCM to further improve con- like car, road, airplane etc. Therefore, it is more realistic
cept detection. The performance of the proposed approach and effective for the end user to express query in terms
is compared with state-of-the-art approaches for the video of semantic objects or concepts. Such queries are mean-
concept detection over the widely used TRECVID data set ingless unless the video data has been labelled with some
and is found to be superior to existing approaches. known set of labels or concepts. It is not feasible to man-
ually label semantic concepts considering the huge volume
of video collections. Therefore, detecting semantic concepts
automatically for video samples is a promising research area
Nitin J. Janwe and is a key step in many video based applications such as
nitinj janwe@yahoo.com annotation, indexing and retrieval. There is a growing need
Kishor K. Bhoyar in automatically detecting concepts from low-level visual
kkbhoyar@yahoo.com properties by learning the correspondence from loosely
labeled data. Semantic gap [15] is the most challenging
1 Department of Information Technology, YCCE, Nagpur, India problem in concept based video retrieval, which is a gap
N. J. Janwe and K. K. Bhoyar
between the low-level representation of video and higher 2.1.1 Object-level action recognition
level semantic concepts which a user associates.
In implementing a system for semantic video concept Robust action detection and recognition of object-level
detection, the problems reported in the literature are: 1) actions is very useful for video content description, since it
Most current methods based on statistical learning tech- provides unique object-level dynamic information. Bobick
niques [16, 17], use low-level features and struggle in and Davis [18, 19] derived the temporal template repre-
bridging the semantic gap. While the methods based on sentation from background subtracted images. A variety of
deep convolutional neural network did not exploit the fusion choreographed actions across different subjects and view
and the concept of asymmetric training of CNN classi- can be handled. Zelnik and Irani [20] use marginal his-
fiers for dealing imbalanced dataset problem and to improve tograms of spatio-temporal gradients at several temporal
concept detection. 2) Many methods used context relation- scales to cluster and recognize video events.
ship of concepts for improving performance but did not
focus on improving performance exploiting the nature of the 2.1.2 Generic concept detection by classifying video
concept. sequences
There are two main contributions of this paper. Firstly,
we have used a fusion of deep Convolutional Neural To recognize generic concepts in general domains, such
Networks (CNN) for building a classifier and used an as broadcast news videos, consumer videos etc., is a more
idea of asymmetric training to deal with dataset imbal- challenging problem. In such cases, there are often fast mov-
ance problem and to improve concept detection per- ing small objects, large camera motion, articulate motions,
formance. Secondly, we have built a novel foreground significant clutter and object occlusion, and the event of
driven concept co-occurrence matrix (FDCCM) by exploit- interest may involve high-level semantics. Dong and Chang
ing the foreground nature of the concepts to predict [21] use kernel-based discriminative classification to detect
background concepts. It is built using local visual co- a large variety of generic events in news videos. Zhou et
occurrence measure from local training dataset and global al. [22] propose a SIFT-Bag based framework where each
visual co-occurrence measure from a collection of random video clip is encoded as a bag of SIFT feature vectors
images retrieved from Google Images. FDCCM is used [43], and the distribution of each video clip is modeled by
to refine concept prediction scores to improve concept Gaussian Mixture Models (GMMs).
detection.
The reminder of the paper is organized as follows: 2.2 Static key-frame based
Section 2 summarizes the related work. Section 3 describes
various steps of semantic video concept detection. Section 4 In the second approach, video streams are segmented into
presents the details of proposed method. Section 5 gives shots and key-frames of a shot are identified which best rep-
experimental results and performance evaluation and lastly, resent it. Statistical learning models such as support vector
Section 6 concludes the paper. machine and neural network are used for training classi-
fiers. Low-level features are extracted from key-frames and
classifiers are trained. There are three approaches for train-
2 Related work ing, 1) Supervised training: where the concepts are fixed
or known 2) Unsupervised training: if concepts are not
A considerable amount of research has been reported in the known and 3) Semi supervised training: if some concepts
literature to address video processing applications such as are known. Zha et al. [35] presents video concept detec-
semantic video concept detection and video retrieval. This tion system using support vector machine as a supervised
section provides a brief review of the related work in the machine learning tool. Janwe and Bhoyar [7] implemented
area of semantic video concept detection. video concept detection using supervised neural network
model. Memar and Affendey [13] proposed an unsupervised
2.1 Temporal-events and object actions based models concept-based video retrieval system based on the integra-
tion of knowledge-based and corpus-based semantic word
In order to detect temporal events and object actions, a similarity measures, in order to retrieve video shots for
group of methods have been proposed to count temporal concepts whose annotations are not available for the system.
information from video sequences. These approaches can
be roughly classified into those classifying global video 2.2.1 Low-level feature-based classifier
sequences into generic semantic categories (such as “wed-
ding”, “rally”) and other recognizing object-level action Videos comprise of information from both visual and
(e.g. “people walking”, “train running”). audio modalities. Each modality brings some information
Multi-label semantic concept detection in videos using fusion of asymmetrically trained deep...
complementary with the other and their joint processing can 2.3 Multi-label based models
help uncover relationships that are otherwise unavailable.
Besides the use of some conventional low-level visual fea- The approaches in the literature on video concept detection are
tures from color, texture, shape and edge categories, some categorized into single-label and multi-label methods. In the
advanced features such as scale invariant feature transform first type, a sample in the dataset is assigned a label (concept)
SIFT and bag of feature BOF [43] can also be used in video whereas in multi-label or multi-concept methods, multiple
concept detection. All of the methods discussed above are concepts can be associated with a data sample. In multi-label
low-level features based. methods, a classifier is trained using either a binary classi-
fication approach or multi-class classification. Wang et al.
2.2.2 Deep feature-based classifier [34], discussed a transductive multi-label learning approach
for video concept detection. Li [10] discusses implementing
In recent years, because of tremendous increases in com- multi-label image classification with a probabilistic label
puting power, more powerful statistical learning models enhancement model by constructing auxiliary labels.
like deep convolutional neural networks have come up.
These models have drastically improved the robustness of 2.4 Visual co-occurrence based models
computer vision systems. CNNs have shown better scal-
ing properties over conventional machine learning methods Visual co-occurrence assists in detecting concepts unlike
such as SVM, principal component analysis and linear other conceptual and perceptual models [27] such as the
discriminant analysis. The study states that, the generic WordNet distance [28]. It is shown [29] that concept co-
descriptors extracted from the deep CNNs have proven to occurrence strengthens the appearance of concepts in video.
be very powerful [11, 12, 14]. CNN has replaced tradi- The co-occurrence models in the literature are mainly
tional image representations using well designed features image based. The approaches used for concept detection
and yield deep hierarchies of features. Recently, researchers in these models have gained an increasing popularity [30–
are moving more and more towards deep CNNs [23, 24] 32]. Feng and Bhanu [33] measured the relatedness of
as (1) they support large volumes of training data and (2) semantic and visual concepts in the image and built con-
due to the availability of high end computing devices such cept co-occurrence patterns. In [28, 29], pairwise concept
as large array of CPU cores [25] and GPUs. Krizhevsky co-occurrence has been integrated into the concept catego-
et al. [23] showed that excellent recognition accuracy can rization framework by using a co-occurrence matrix. Modiri
be achieved training a large CNN [24] on a large dataset et al. [9] proposed a contextual approach to complex video
by standard back-propagation [26] algorithm. Chien-Hao et classification based on generalized maximum clique prob-
al. [2] implemented high-level image features learning by lem which uses the co-occurrence of concepts as the context
deep convolutional neural network in image retrieval sys- model. In [1], Feng et al. used the concept co-occurrence
tem. Similarly, Podlesnaya and Podlesnyy [3] built video patterns and concept signatures for image annotation and
indexing and retrieval system based on features extracted retrieval application. These approaches have several advan-
by convolutional neural network. McCormac et al. [4] dis- tages over standard concept inference techniques, for exam-
cussed the use of CNN to implement intuitive user inter- ple, incorporating semantic context compensates the ambi-
action by combining CNN and a state of the art dense guity of concept visual appearance. However, the matrix of
Simultaneous Localization and Mapping (SLAM) system. the co-occurrence has an inevitable pairwise constraint on
the relationship.
2.2.3 Combining low-level and deep features Several recent works explore multi-concept learn-
ing/detection techniques for automated image annotation
Kikuchi et al. [5] presented a new feature extraction method that aim to model the co-occurrence information among
where they used a combination of low-level features (e.g., concepts/annotations. In very few works, a space has been
SIFT and HOG) and convolutional neural network (CNN) given on the possible coding of the concept tags or concepts
derived features to represent the meaningful objects that in concept detection methods. Chen et al. [40] proposed a
contributes to the determination of semantic concepts to method about the coding of tags which are based on binary
achieve accurate and robust semantic indexing of videos. numbers and transformed by Chinese characters.
In the literature of semantic video concept detection,
indexing and retrieval systems, TRECVID [47] contributes
a lot in the ongoing research in the field. Awad et al. [6] 3 Semantic video concept detection
presents a detailed review of the work presented and run at
TRECVID for semantic indexing task from 2010 to 2015 for The goal of semantic video concept detection is to detect
content based access to video documents and collections. semantic concepts of a video segment based on its visual
N. J. Janwe and K. K. Bhoyar
appearance. The pipeline of a complete process of seman- we provide a set of extracted key-frames of video shots
tic video concept detection system is shown in Fig. 1 and as input and the corresponding concepts as output. The
consists of four steps, namely: (1) shot boundary detection multi-label video concept detection task is posed into binary
or shot segmentation, (2) key-frame extraction, (1) classifier classification problem. CNN is used as a baseline.
training, and 4) concept detection.
The input to the system is a video stream. The details of 3.3.1 Overview of CNN
the above four steps are described below.
A neural network is a function g mapping data x, for exam-
3.1 Shot segmentation ple an image, to an output vector y, for example an image
label. The function is the composition of
In order to detect semantic concepts precisely from a video, the sequence of simpler functions fl which are called com-
it needs to be segmented into video shots. Shot exhibits putational blocks or layers. Let x1 , x2 , . . . .xL be the output
strong content correlations between frames; hence shots of each layer in the network. And let x0 = x denote the
are considered to be the basic units in concept detection. network input. Each intermediate output xl = fl (xl−1 ; wl )
Generally shot boundaries are of two types, a cut, where is computed from the previous output xl−1 by applying the
the transition between two consecutive shots is abrupt, and function fl with parameters wl
gradual transitions, where the boundary is stretched over In CNN, the data has a spatial structure: each xl RHl Wl Cl
multiple frames. The examples are dissolve, fade-in and is a 3D array or tensor where the first two dimensions
fade-out etc. The shot boundary detection methods usually Hl (height) and Wl (width) are interpreted as spatial dimen-
extract visual features from each frame and the similarities are sions. The third dimension Cl is interpreted as the number
measured and detect shot boundaries between frames that of feature channels. Hence, the tensor xl represents Hl × Wl
are dissimilar. field of Cl dimensional feature vectors, one for each spa-
tial location. A fourth dimension Nl in the tensor spans
3.2 Key-frame extraction multiple data samples packed in a single batch for effi-
ciency parallel processing. The number of data samples
In video processing applications, the shots are often repre- Nl in a batch is called the batch cardinality. The network
sented by a frame, called a key-frame, which is supposed is called convolutional because the functions fl are local
to be a representative frame for a shot. There are great sim- and translation invariant operators (i.e. non-linear filters)
ilarities among the frames from the same shot; therefore like linear convolution. CNNs are often used as classi-
certain frames that best reflect the shot contents are chosen fiers or regressors. The output of CNN is ŷ = f (x) is
as key-frames. Mostly, the middle frame of a shot is taken a vector of probabilities, one for each of an N possi-
as a key-frame, assuming that middle segment contains key ble image labels (dog, cat, trilobite, etc.). If y is the true
contents, but many other techniques do exist. It is not nec- label of image x, we can measure the CNN performance
essary that a shot is always represented by a single frame; in by a loss function ly ŷ ∈ R which assigns a penalty
some cases of more visual complexity multiple key-frames to classification errors. The CNN parameters can then be
are required. For such cases, approaches like unsupervised tuned or learned to minimize this loss averaged over a
clustering can be used, where frames in a shot are clustered large dataset of labelled example images. Learning a CNN
depending on the variation in shot content and then choose requires computing the derivative of the loss with respect
the frame closest to the cluster center as a key-frame. Each to the network parameters. Derivatives are computed using
cluster is represented by a unique key-frame. So a single an algorithm called back-propagation. CNNs are inherently
shot can have multiple key-frames. The choice of a key- translation invariant. Their basic components (convolution,
frame may also depend on the object or the event, one is pooling, and activation functions) operate on local input
interested in. The frame that best represents the object or the regions, and depend only on relative spatial coordinates.
event can be chosen as a key-frame. The details of the most important CNN operations is as
follows:
3.3 Training a CNN classifier
(1) Convolution: The convolutional block is implemented
Automatic multi-label semantic concept detection for by the function y =convolution(x, f, b), computes the
videos is a supervised machine learning problem, where convolution of the input map x with a bank of K
multi-dimensional filters f and biases b. Here, x ∈ channels. Note that input x and output y have the
RH ×W ×D , f ∈ RH ×W ×D×D , y ∈ RH ×W ×D . same dimensions.
The output of convolving a signal for 1D slice, is • Spatial Normalization: The spatial normalization
given by following (1), operator acts on different feature channels inde-
pendently and rescales each input feature by the
H
W
D
energy of the features in a local neighborhood.
yi j d = bd + fi j d × xi +i −1,j +j −1,d ,d .
First, the energy of the features in a neighborhood
i =1 j =1 d =1
W × H is evaluated by (8)
(1)
(2) Padding and Stride: Convolution function allows to spec- 1
n2i j d = x 2
ify top-bottom-left-right paddings (Ph− , Ph+ , Pw− , Pw+ ) W H i +i −1− H −1
,j +j −1− W −1
,d.
1≤i ≤H ,1≤j ≤W 2 2
of the input array and subsampling strides (Sh ,Sw ) of
(8)
the output array:
(6) Softmax: Computes the softmax operator:
H
W
D
yi j d = bd + fi j d × xSh (i −1)+i −P − ,Sw (j −1)+j −Pw− ,d ,d . exij k
i =1 j =1 d =1
h yij k = D
(9)
xij t
(2) t=1 e
Note that the operator is applied across feature chan-
(3) Spatial Pooling: It is of two types; max and sum nels and in a convolutional manner at all spatial loca-
pooling tions. Softmax can be seen as the combination of an
activation function (exponential) and a normalization
• Max Pooling: Computes the maximum response
operator.
of each feature channel in a H × W patch by (3)
because these are heterogeneous. Tian et al. [41] developed that classes with sufficiently large sample size (strong
a correlation component manifold space learning (CCMSL) classes) tend to learn quickly and give better detection rate
approach to first learn a common feature space by captur- in the early stage of the classifier training than the classes
ing the correlations between the heterogeneous databases, with smaller sample size (weak classes). Therefore, in order
and then in the resulting space establish a single age esti- to achieve better detection rate for weak classes a classi-
mator across such heterogeneous datasets through correla- fier needs to be trained for many more epochs. But, if the
tion representation learning (CRL). As a result, probability classifier is trained further, we may lose the detection per-
distribution-incompleteness of individual datasets are com- formance of strong concepts because of overfitting problem.
pensated, but also the discriminating ability of the estimator To tackle this difficulty, we suggest to divide the dataset
is reinforced. classes into two groups by applying the Global Thresh-
olding method given in Algorithm 1. The first group will
contain the concepts from cluster-1 having smaller popu-
4 Proposed method lation size and second group will have the concepts from
cluster-2 with larger sample population size. The proposed
The framework of the proposed system is given in Fig. 2. method uses two different CNNs for two groups of classes.
The figure also highlights the contributions of the paper. These are trained independently with the same dataset so
The first contribution is made at the classifier level, where that the detection rate of the concepts it is trained for will be
a fusion of asymmetrically trained CNNs is proposed to the highest. After this asymmetric training, the concept pre-
build a classifier to tackle dataset imbalance problem. The diction scores of both CNNs are fused to get final scores as
second contribution is the use of foreground driven con- shown in Fig. 3.
cept co-occurrence matrix (FDCCM) after the classification This approach can be extended further for more numbers
stage, to refine concept prediction scores to further improve of groups, depending on the variation in the sample size.
detection performance. But as we go on increasing the groups, in worst case, we
would have a class (concept) per group and would result
4.1 Fusion of asymmetrically trained deep CNNs in a dedicated concept detector. Another drawback is the
increased time complexity, as it requires more CNNs to be
The asymmetric training approach is very useful in datasets used to build a classifier.
where the number of training samples of each class are con- In the proposed method, a classifier is suggested to be
siderably different. Such a dataset is known as imbalance built using a fusion of asymmetrically trained deep CNNs.
dataset. The rationale behind the approach is the observation The CNNs are built using the network shown in Fig. 4.
Concept Classifier
Test key-frames Contribution #1
(Fusing asymmetrically trained CNNs)
dataset Detected concepts
CNN-1 Face
Outdoor
+
Person
CNN-2 Sky
Asymmetrically trained CNNs Vegetation
Ranked
concept
Training prediction
key-frames scores
dataset
Concept Probability Concept Probability
Face 0.523 Face 0.523
Crowd 0.428 Sky 0.476
Key-frame extraction
Person 0.254 Vegetation 0.345
……… …….. ……… ……..
Refined concepts
Contribution #2
Truck Road Sky Studio Snow C1 C2 C3 C4 C5 C6 - C36
Studio Car Tree Bus Sports C1 1 0 0 0.5 0.6 0.9 - 0
Training Video C2 0 1 0.5 0 0 0 - 0.5
Sports Tree Person Road Studio
C3 - - 1 - - - - -
Video segmentation Car Sports Road Bus Tree
- - - - - - - - -
Dataset Road Person Studio Car Sports C36 - - - - - - - 1
Fig. 2 Flowchart of the proposed semantic video concept detection method using convolutional neural network (CNN) and foreground driven
concept co-occurrence matrix (FDCCM)
Multi-label semantic concept detection in videos using fusion of asymmetrically trained deep...
4.1.1 Architecture used for convolutional neural network features from the top convolutional layer as input in vector
form. C-way Softmax is the last layer where C being the
The architecture of the CNN is comprised of 7 layers. A key- number of classes which is 36 in our case.
frame of size 352 × 288 (with three color channels), first
resampled to 88 × 72 and then converted to gray-scale (sin- 4.2 Concept co-occurrence matrix (CCM)
gle channel) is presented as input. This is convolved with 40
different first layer square filters, each of size 6 × 6 using a This section presents the introduction of CCM and discusses
stride of 2 in both x and y. The resulting feature maps are the detailed algorithm to construct it. Following subsection
then: (i) pooled (max within 2 × 2 regions, using stride 2), explains what FDCCM is and how it is derived from CCM.
(ii) contrast normalized across feature maps and (iii) passed Visual co-occurrence assists in providing semantic hints
through a rectified linear function, to give 40 different 17 in detecting concepts. It has been shown that the appear-
× 21 element feature maps. Similar operations are repeated ance of each concept is consolidated by their co-occurrence
in layer 2. The last two layers are fully connected, taking in a video shot. Existing approaches consider co-occurrence
N. J. Janwe and K. K. Bhoyar
Fig. 3 A fusion of
asymmetrically trained CNN
classifiers
of pairs of concepts. If “Road” & “Outdoor” is a co- means the presence of concept “Airplane” consolidates the
occurrence pair, then the probability of presence of “Out- presence of concepts “Bus” by a factor of 0.5, “Face” by
door” can be strengthened by a strong confidence of 0.6, “Outdoor” by 0.9 and “Sports” by 0.5. The proba-
“Road”. Since we are dealing with multi-label data in our bility of these concepts will be strengthened by respective
approach, we have extended the pairwise co-occurrence co-occurrence values.
concept to multi-concept co-occurrence. For example, for In our work, CCM is implemented using two data col-
a concept “Car”, if the confidence of co-occurrence pairs, lections. In this section, we discuss the construction of
“Car-Road”, “Car-Outdoor”, “Car-Vegetation” and “Car- the CCM using a training video dataset. The video shot
Sky” is significant in the training data, then the strong segmentation and key-frame extraction steps, transforms
confidence of concept “Car” in the test sample could con- a training video dataset into key-frame dataset. The key-
solidate the presence of multiple concepts “Road”, “Out- frames dataset is pre-annotated. We have used TRECVID
door”, “Vegetation” and “Sky” by appropriate proportions [47] development key-frame dataset supplied by NIST [46]
depending on co-occurrence values. We need to construct as a ground-truth for training dataset. We built a concept
CCM to keep co-occurrence data. co-occurrence matrix from the annotated key-frames of a
CCM is a matrix of size nx n, where n are the number of training video dataset. The key-frames in the ground-truth
concepts in the vocabulary set. It keeps the information of dataset are multi-concept key-frames i.e. a single key-
the co-occurrence pairs and their co-occurrence values. The frames has got multiple concepts. Let = {c1 , c2 , . . . .., cn }
structure of CCM is shown in Table 1 and is generated using be the concept vocabulary in the training key-frames dataset,
Algorithm 2. The highlighted row in the CCM, indicates that where n is total number of unique concepts annotated to the
the presence of a concept “Airplane” also predicts the prob- key-frames that the system is attempting to detect. Hence
ability of presence of concepts “Bus”, “Face”, “Outdoor” the size of the concept co-occurrence matrix is n × n. Let
and “Sports” by 0.5, 0.6, 0.9 and 0.5 respectively. It also T = {kf1 , kf2 , . . . . . . , kfm ) denote the training key-frame
filter size
6 300 36
60 100 500
40 class
2x2 2x2
Stride 2 2x2 Stride 2 Stride 1 Stride 1 softmax
max pool avg. pool
max pool
21
9 Contrast Stride 1
1 Contrast
Norm. Norm. 6 1
3 2 4
17 2 1
7
2 conv. 1 conv. 1 conv.
1 conv.
100 1 conv.
40 60 300 500
ReLU ReLU ReLU
dataset with size m. The CCM is constructed by count- numbers of values each gives the co-occurrence count for
ing the total number of co-occurrences of each concept the concepts c1 to cn . Table 2 shows the concepts and their
ci for each individual concept cj for the m key-frames range for the training dataset used to construct CCM. Figure 5
in the training dataset. There are n rows in a matrix. A shows the CCM constructed from TRECVID dataset using
row r1 represents the concept c1 and consists of total n Algorithm 2.
The CCM used is a combination of two CCMs computed obtained by combining two matrices by taking their average
at two levels, namely, local visual co-occurrence level and as given by (10).
global visual co-occurrence level. CCM is evaluated at local
visual co-occurrence level from the local pre-labelled train- CCM Z [I, J ] = Avg (CCM X [I, J ] , CCM Y [I, J ]) (10)
ing dataset and at global visual co-occurrence level from a
collection of random images retrieved from Google Images, Where CCM X and CCM Y are constructed at local level
and are maintained in two separate CCMs. The final CCM is and global level respectively and CCM Z is final resultant
N. J. Janwe and K. K. Bhoyar
co-occurrence matrix. I and J are rows and columns of the it is always possible to predict background concept if fore-
matrix. ground concept is available, e.g. if an image consists of
concept Airplane, then we can easily predict the concept
4.2.1 Foreground driven concept co-occurrence matrix Sky in the background, whereas the Airplane concept can-
(FDCCM) not always be predicted with the Sky in the background.
Hence, out of two co-occurrence pairs Airplane-Sky and
The CCM we prepared gives the co-occurrence values of Sky-Airplane, first pair is meaningful and is of type Fore-
every pair of concepts in concept vocabulary and is used to ground driven.
improve concept detection. If we observe a list of concepts, Inspired by this observation, we have proposed a novel
we find that it consists mainly of two types, 1) Foreground CCM, named Foreground driven Concept Co-Occurrence
or active concepts: the concepts or objects in a key-frame of Matrix (FDCCM) which consists of rows of type foreground
foreground nature (e.g. Airplane, Car or Bus) and 2) Back- only. FDCCM is used for refining concept probabilities
ground or passive concepts: these concepts are always a only when foreground concepts are found in a list of top
part of background scenery (e.g. Sky and Vegetation). Here, ranked scores. FDCCM has the advantage that the negative
we propose a special type of CCM which considers above impact of predicting foreground concepts on the basis of
division of concepts, we can divide co-occurrence pairs background concept is reduced.
into four types 1) Foreground-Background concept pair 2) To generate FDCCM, we have identified twelve concepts
Background-Background pair 3) Background-Foreground out of 36 of foreground nature from the concept vocabulary.
pair and 4) Foreground-Foreground pair. We do observe that Table 3 shows a list of identified foreground concepts.
Table 2 The training dataset with a range of start.to end key-frame no. for each concept
Fig. 7 The procedure of refining concept prediction scores using FDCCM (keeping n = 6)
Multi-label semantic concept detection in videos using fusion of asymmetrically trained deep...
procedure, a top score concept is picked-up from a ranked In our experimentation, the dataset used for training and
list of concepts and using FDCCM, concept prediction testing the CNN classifier is TRECVID. Once CNN is
scores are updated according to the (11). This process is trained, its performance can be tested using testing dataset.
repeated for all the top k concepts of the list and the updated The performance measures used are Average Precision (AP)
(refined) values are again re-ranked. This re-ranked list of and Mean Average Precision (MAP). The details of the
scores of concepts are the final detected concepts. dataset and performance measures are as follows:
4.4 Computing refined concept prediction scores 5.1.1 TRECVID datasets and ground-truth data
The problem of refining concept prediction scores obtained The National Institute of Standards and Technology (NIST)
in the output of CNN classifier fits as a problem of finding [46] is responsible for the annual Text Retrieval Confer-
a union of fuzzy sets using an equation for algebraic sum ence (TREC) Video Retrieval Evaluation (TRECVID) [47]
for S-Norm. CCM is considered as a first fuzzy set of pos- since 2001. Every year, it provides a test collection of
sibility distribution for concept co-occurrence and the CNN video datasets along with a task list. It focuses its efforts
output, a set of concept prediction scores to be refined is to promote progress in video analysis and retrieval. It
considered as the second fuzzy set. The refined scores are also provides ground-truth for genuine researchers. Many
obtained by computing a fuzzy set union by an equation of researchers and research teams participate and contribute in
algebraic sum for S-norm given by (11), yearly organized conferences and workshops.
and Test dataset consists of 9,352 randomly chosen pos- 2. Pick-up top di scores from a ranked list, let Pi is
itive key-frames from next 20 videos and is used to test concept list from top d concepts. The intersection Mi
classifier performance. There are 36 defined concepts in between Yi & Pi are the detected concepts.
the dataset. The concept list is given in Table 5. NIST
Let Ni is the label density for xi , and intersection
provides ground-truth data for the above dataset which has
between Yi and Pi is Mi , then Mi are correctly predicted
manually annotated 36 concepts over these key-frames. The
concepts by a classifier.
ground-truth consists of video shots and their representative
The average precision (AP) for a test sample is computed
key-frame/s. It is noted that a shot in a video clip may have
by (12),
one or more positive and/or negative key-frames. For a con-
cept, positive key-frame is defined as a frame containing a
said concept as a visual content.
|Yi Pi | Mi
The ground-truth dataset consists of both positive as well APi = = (12)
as negative examples. For the experimentation, the ground- |Pi | Ni
truth data for the development dataset is used. Figure 8
illustrates the distribution of positive examples for each
individual 36 concepts in the Training dataset and Table 6 And the Top-d MAP for a classifier, H , on dataset D, is
presents concept defining key-frames from the TRECVID obtained by computing the mean of APs by (13),
training dataset.
Table 7 Number of positive samples per concept in the training dataset. The Group-1 and Group-2 concepts are marked by green and orange
color respectively
Table 8 MAP with Top-d performance measures for random three CNN2 classifier models trained with different epochs
Methods # of epochs for CNN1 = 4 # of epochs for CNN1 = 4 Classifier for proposed method
# of epochs for CNN2 = 36 # of epochs for CNN2 = 76 # of epochs for CNN1 = 4
# of epochs for CNN2 = 187
Top-4 Top-5 Top-7 Top-n Top-4 Top-5 Top-7 Top-n Top-4 Top-5 Top-7 Top-n
Values in bold indicate the results of the proposed method for the classifier used
0.7
0.6
0.5
0.7 0.4
MAP
Top-4
Proposed 0.1
0.3
method
performance
0.2 0
Scnn4 Scnn187 Sccm Fcnn Fccm Proposed
0.1 Ffdccm
CONCEPT DETECTION MEHODS
0
1 2 3 4 5 6 7 8 9 10 15 20 n
'n' Fig. 10 Video concept detection performance of the approaches used
on the TRECVID dataset and measured by Top-d MAP on the CNN1
Fig. 9 Selection of best value of ‘n’ for proposed method = 4 epochs & CNN2 = 187 epochs
Table 9 Comparison of results in terms of the predicted concepts on five test key-frames by proposed and other methods
# (N) 8 6 4 6 7
Scnn4 Face, Outdoor, Person, Sky, Face, Outdoor, Person, Outdoor Face, Outdoor, Person, Face, Outdoor, Person,
Urban Urban Urban Sky, Urban
AVERAGE PRECISION
test key-frames
0.8
Scnn4
Scnn187
0.6
Sccm
Fcnn
0.4
Fccm
Proposed Ffdccm
0.2
0
KF-1524 KF-2846 KF-4600 KF-7434 KF-7787
TEST KEY-FRAMES
5.4.1 CNN-based concept detectors from C1 and scores of Group-2 concepts from C2 are
combined to yield the final prediction scores. These final
We performed two experiments to build CNN-based con- merged scores are used to predict the concepts. The per-
cept detector. In the first, a method Scnn4 is implemented formance of the fused classifier is evaluated over the test
using a classifier C1 of type deep CNN. C1 is trained for dataset using TOP-d MAP values.
4 epochs with key-frames training dataset and its detection
performance is measured over testing dataset. The output of 5.4.2 Refining concept scores using FDCCM
the classifier C1 , as the last layer of the network shows, is a
set of concept prediction scores; and gives us the probability In this experimentation, to exploit context relationship
of presence of individual concepts. The concept prediction among concepts, CCM and later FDCCM is derived from
scores need to be ranked in descending order to measure the training dataset and random image dataset from Google
performance and is evaluated using MAP of Top-d measure images using Algorithm 2. FDCCM is used to refine
for different values of d. concept probabilities. In the first part of the experimen-
In the second experimentation, a classifier C2 is created tation, a method Sccm is implemented to combine CNN
with the same network structure used for C1 . It is trained and FDCCM and performance is tested. In the second
for sufficiently large number of epochs so that the concepts part, a proposed method Ff dccm is implemented where we
with smaller data size should be recognized. C2 is trained investigated combined application of classifier fusion and
for 187 epochs with the same training dataset. Its detection FDCCM. The performance of these methods are tested and
performance is measured separately as with C1 . We name measured on all the three sample CNN classifiers trained
this method Scnn187 . with 36, 76 and 187 epochs and is given collectively in
In the third experimentation, in an attempt to further Table 8. The results show that the MAP for all the Top-
improve detection rate, we implemented a method Fcnn187 d measures for all of the methods for the CNN classifier
where C1 and C2 are fused together. C1 is used to detect trained with 187 epochs is better than rest of the two
concepts from Group-1 and C2 is used to detect con- classifiers. Hence we have chosen second CNN2 classifier
cepts from Group-2 as shown in Table 7. In the classifier with 187 epochs whereas the first CNN1 is trained for 4
fusion process, the prediction scores of Group-1 concepts epochs.
Table 10 Performance comparison of proposed method with state-of-the-art other methods, showing superiority of our approach
In our further experimentation, we thoroughly investi- far better than the other existing techniques reported in the
gated the optimal length of top values to be considered literature.
from ranked list for refinement of concepts where classifier
performance is optimum. Interestingly, the optimal value of
the ranked list of scores to be considered is found it to be 6. References
The detailed result is given in Fig. 9. In all our experimen-
tation, the MAP values are computed keeping this length to 1. Feng L, Bhanu B (2016) Semantic concept co-occurrence patterns
top 6 scores. for image annotation and retrieval. IEEE Trans Pattern Anal Mach
Intell 38(2):785–799
From Fig. 10, it is observed that, the proposed method
2. Kuo CH, Chou YH, Chang PC (2016) Using deep convolu-
outperforms all other methods. tional neural networks for image retrieval. Soc Imag Sci Technol.
Table 9 presents the comparison of results for the pro- https://doi.org/10.2352/ISSN.2470-1173.2016.2.VIPC-231
posed method and other experimented methods with respect 3. Podlesnaya A, Podlesnyy S (2016) Deep learning based semantic
video indexing and retrieval. arXiv:1601.07754 [cs.IR]
to concepts detected and the average precision (AP) for the
4. McCormac J, Handa A, Davison A, Leutenegger S (2016) Seman-
five sample test key-frames from test dataset. Figure 11 ticFusion: dense 3D semantic mapping with convolutional neural
shows the superiority of proposed method. networks. arXiv:1609.05130v2 [cs.CV]
We have also compared the performance of the pro- 5. Kikuchi K, Ueki K, Ogawa T, Kobayashi T (2016) Video seman-
tic indexing using object detection-derived features. In: Proc. 24th
posed method with other existing state-of-the-art methods.
European signal processing conference (EUSIPCO). Budapest, pp
Table 10 shows that the proposed method exhibits outstand- 1288–1292
ing performance over the other state-of-the-art methods in 6. Awad G, Snoek CGM, Smeaton AF, Quénot G (2016) TRECVid
the area. semantic indexing of video: a 6-year retrospective. ITE Trans Med
Technol Appl (MTA) 4(1):187–208
The work in this paper can be extended to heterogeneous
7. Janwe NJ, Bhoyar KK (2016) Neural network based multi-label
training datasets where a concept classifier is trained using semantic video concept detection using novel mixed-hybrid-
two different training datasets to compensate the concept fusion approach. In: Proceedings of the 2nd international confer-
estimation incompleteness of individual datasets and to rein- ence on communication and information processing, ICCIP 2016.
force discriminating ability of the estimator [41]. It can also ACM, Singapore, pp 129–133
8. Vedaldi A, Lenc K (2015) MatConvNet: convolutional neural net-
be extended to support sharable environment such as cloud works for MATLAB. In: Proc. of the int. conf. on multimedia.
infrastructure [44]. ACM, pp 689-692. https://doi.org/10.1145/2733373.2807412
9. Modiri S, Amir A, Zamir R, Shah M (2014) Video
classification using semantic concept co-occurrences.
https://doi.org/10.1109/CVPR.2014.324
6 Conclusion 10. Li X, Zhao F, Guo Y (2014) Multi-label image classification with
a probabilistic label enhancement model. In: UAI’14 Proceedings
In this paper, a novel contribution in the area of multi-label of the thirtieth conference on uncertainty in artificial intelligence,
semantic video concept detection has been presented. It pp 430-439
11. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E,
introduced a framework for multi-label video concept detec- Darrell T (2014) Decaf: a deep convolutional activation feature
tion using state-of-the-art asymmetrically trained fusion of for generic visual recognition. In: Proceedings of the interna-
convolutional neural networks and foreground driven con- tional conference on machine learning, ICML. Beijing, pp 647–
cept co-occurrence matrix, FDCCM. FDCCM keeps the 655
12. Zeiler MD, Fergus R (2013) Visualizing and understanding con-
co-occurrence data for each foreground concept with other volutional networks. arXiv:1311.2901 [cs.CV]
concepts that has been built from training dataset at local 13. Memar S, Suriani AL (2013) An integrated semantic-based
level and from a collection of random images retrieved by approach in concept based video retrieval. Multimed Tools Appl
Google Images at global level. The approach is evaluated 64:77–95. 10.1007/s11042-011-0848-4
14. Oquab M, Bottou L, Laptev I, Sivic J (2013) Learning and trans-
for video concept detection task over the TRECVID dataset ferring mid-level image representations using convolutional neural
and compared with state-of-the-art methods, using mean networks. Technical Report HAL-00911179, INRIA
average precision. The experimental results show that the 15. Ma H, Zhu J, Lyu MRT, King I (2010) Bridging the seman-
introduction of a novel classifier using a fusion of asymmet- tic gap between image contents and tags. IEEE Trans Multimed
12(5):462–473
rically trained deep CNNs to deal with dataset imbalance 16. Jia D, Berg A, Fei-Fei L (2011) Hierarchical semantic indexing for
problem improves video concept detection rate, as it is evi- large scale image retrieval. In: Proceedings of the 2011 IEEE con-
dent from improved MAP scores for various Top-d values. ference on computer vision and pattern recognition, CVPR 2011.
This work also proves that, the use of fusion of asymmet- Colorado Springs, pp 785–792
17. Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing
rically trained CNNs and FDCCM together has improved objects by their attributes. In: 2009 IEEE Computer society con-
overall concept detection. The results summarized in Table 10 ference on computer vision and pattern recognition workshops,
shows that the performance of the proposed approach is CVPR Workshops. Miami, pp 1778–1785
N. J. Janwe and K. K. Bhoyar
18. Bobick A, Davis J (2001) The recognition of human movement 40. Chen X, Chen S, Wu Y (2017) Coverless information hiding
using temporal templates. IEEE Trans Pattern Anal Mach Intell method based on the Chinese character encoding. J Int Technol
23(1):257–267 18(2):91–98. https://doi.org/10.6138/JIT.2017.18.2.20160815
19. Davis JW, Bobick AF (1997) The representation and recognition 41. Tian Q, Chen S (2017) Cross-heterogeneous-database age estima-
of action using temporal templates. In: Proc. IEEE International tion through correlation representation learning. J Neurocomput
conference on computer vision and pattern recognition, pp 928– 238:286–295
934 42. Xue Y, Jiang J, Zhao B, Ma T (2017) A self-adaptive artificial
20. Zelnik ML, Irani M (2006) Statistical analysis of dynamic actions. bee colony algorithm based on global best for global optimization.
IEEE Trans Pattern Anal Mach Intell 28(9):1530–1535 Soft Comput 1–18. https://doi.org/10.1007/s00500-017-2547-1
21. Dong X, Chang SF (2007) Visual event recognition in news 43. Yuan C, Xia Z, Sun X (2017) Coverless image steganog-
video using kernel methods with multi-level temporal alignment. raphy based on SIFT and BOF. J Int Technol 18(2):209–
In: Proc. IEEE international conference on computer vision and 216
pattern recognition. Minneapolis 44. Wei W, Fan X, Song H, Fan X, Yang J (2016) Imperfect infor-
22. Zhou X, Zhuang X, Yan S, Chang SF, Hasegawa-Johnson M, mation dynamic stackelberg game based resource allocation using
Huang TS (2008) Sift-bag kernel for video event analysis. In: hidden Markov for cloud computing. IEEE Trans Services Com-
Proc. ACM international conference on multimedia. Vancouver, put (99) https://doi.org/10.1109/TSC.2016.2528246
pp 229–238 45. Chen Y, Hao C, Wu W, Wu E (2016) Robust dense reconstruction
23. Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classifi- by range merging based on confidence estimation. Sci Chin Inf
cation with deep convolutional neural networks. In: ANIPS, pp 1–8 Sci 59(9):1–11. https://doi.org/10.1007/s11432-015-0957-4
24. LeCun L, Bottou Y, Bengio, Haffner P (1998) Gradient based 46. NIST: http://www.nist.gov
learning applied to document recognition. Proc IEEE 86(5):2278– 47. TRECVID: http://www-nlpir.nist.go
2324
25. Dean G, Corrado R, Monga K, Chen M, Devin Q, Le M, Mao M,
Ranzato A, Senior P, Tucker K, Yang, Ng A (2012) Large scale
distributed deep networks. In: NIPS, pp 1–9
26. Rumelhart D, Hinton G, Williams R (1986) Learning representa-
tions by back-propagating errors. Nature 323(6088):533–536 Nitin J. Janwe has completed
27. Torralba A, Murphy KP, Freeman WT (2004) Contextual models his B.E. from S.G.G.S. Insti-
for object detection using boosted random fields. In: Proc. Adv. tute of Technology, Nanded,
neural inf. process. syst., pp 1401–1408 Maharashtra, India. He did
28. Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S his M.Tech.(CSE) from RTM
(2007) Objects in context. In: Proc. 11th IEEE int. conf. comput. Nagpur University, Nagpur,
vis., pp 1–8 India. Currently he is a PhD
29. Galleguillos C, Rabinovich A, Belongie S (2008) Object catego- student at Department of In-
rization using co-occurrence, location and appearance. In: Proc. formation Technology, Yesh-
IEEE Conf. comput. vis. pattern recog., pp 1–8 wantrao Chavan College of
30. Hwang S, Grauman K (2010) Reading between the lines: object Engineering, Nagpur, Maha-
localization using implicit cues from image tags. In: Proc. IEEE rashtra, India. He is a member
Conf. comput. vis. pattern recog., pp 1145–1158 of CSI & IEEE. His area of
31. Torralba A (2003) Contextual priming for object detection. Int J interest is Image & Video Pro-
Comput Vis 53(2):169–191 cessing and Computer Vision
32. Divvala S, Hoiem D, Hays J, Efros A, Hebert M (2009) An empir- & Machine Learning. He is
ical study of context in object detection. In: Proc. IEEE Conf. working as Associate Professor in Department of Computer Technol-
comput. vis. pattern recog., pp 1271–1278 ogy, Rajiv Gandhi College of Engineering, Research & Technology,
33. Feng L, Bhanu B (2012) Semantic-visual concept relatedness and Chandrapur, Maharashtra, India.
co-occurrences for image retrieval. In: ICIP, pp 2429–2432
34. Wang J, Zhao Y, Wu X, Hua XS (2011) A transductive multi-label
learning approach for video concept detection. Pattern Recogn
44:2274–2286
35. Zha ZJ, Liu Y, Mei T, Hua XS (2007) Video concept detec-
tion using support vector machines - trecvid 2007 evaluations. Kishor K. Bhoyar is cur-
Technical report Microsoft Research Lab – Asia rently working as a Professor
36. Mazloom M, Li X, Snoek CGM (2016) TagBook: a semantic in the Department of Informa-
video representation without supervision for event detection. IEEE tion Technology, Yeshwantrao
Trans Multimed 18(7):1378–1388 Chavan College of Engineer-
37. Markatopoulou F, Mezaris V, Patras I (2015) Cascade of classi- ing, Nagpur, India. He has
fiers based on binary, non-binary and deep convolutional network over 27 years of experience.
descriptors for video concept detection. In: Proc. IEEE Int. conf. He has been awarded PhD
on image processing. Quebec City, pp 1786–1790 degree in Computer Science
38. Markatopoulou F, Mezaris V, Patras I (2016) Deep multi-task and Engineering by Vishwesh-
learning with label correlation constraint for video concept detec- warayya National Institute of
tion. In: Proc. of the ACM multimedia conference. Amsterdam, pp Technology, Nagpur. He is
501–505 a member of ACM,CSI, and
39. Sun Y, Sudo K, Taniguchi Y (2014) TRECVid 2013 semantic IACSIT. His areas of interests
video concept detection by NTT-MD-DUT. In: Proc. of Trecvid are Image & Video Processing
2014 and Machine Learning.