Janwe 2017

Appl Intell
DOI 10.1007/s10489-017-1033-x
Multi-label semantic concept detection in videos

using fusion of asymmetrically trained deep convolutional
neural networks and foreground driven concept
co-occurrence matrix
Nitin J. Janwe1 · Kishor K. Bhoyar1
© Springer Science+Business Media, LLC 2017
Abstract Describing visual contents in videos by seman- Keywords Semantic video concept detection · Foreground
tic concepts is an effective and realistic approach that can driven concept co-occurrence matrix · Convolutional neural
be used in video applications such as annotation, index- network · Deep learning · Multi-label classification ·
ing, retrieval and ranking. In these applications, video data Asymmetric training
needs to be labelled with some known set of labels or
concepts. Assigning semantic concepts manually is not fea-
sible due to the large volume of ever-growing video data. Abbreviations
Hence, automatic semantic concept detection of videos is
CCM concept co-occurrence matrix
a hot research area. Recently Deep Convolutional Neu-
FDCCM foreground driven concept co-occurrence
ral Networks (CNNs) used in computer vision tasks are
matrix
showing remarkable performance. In this paper, we present
CNN convolutional neural network
a novel approach for automatic semantic video concept
detection using deep CNN and foreground driven con-
cept co-occurrence matrix (FDCCM) which keeps fore- 1 Introduction
ground to background concept co-occurrence values, built
by exploiting concept co-occurrence relationship in pre- Due to the advancement of networking and multimedia
labelled TRECVID video dataset and from a collection technologies, use of multimedia data increased exponen-
of random images extracted from Google Images. To deal tially, and video indexing and retrieval is seen as one of
with the dataset imbalance problem, we have extended this the most challenging issues in the area. Users always desire
approach by making a fusion of two asymmetrically trained to retrieve videos on the basis of some semantic objects
deep CNNs and used FDCCM to further improve con- like car, road, airplane etc. Therefore, it is more realistic
cept detection. The performance of the proposed approach and effective for the end user to express query in terms
is compared with state-of-the-art approaches for the video of semantic objects or concepts. Such queries are mean-
concept detection over the widely used TRECVID data set ingless unless the video data has been labelled with some
and is found to be superior to existing approaches. known set of labels or concepts. It is not feasible to man-
ually label semantic concepts considering the huge volume
of video collections. Therefore, detecting semantic concepts
automatically for video samples is a promising research area
Nitin J. Janwe and is a key step in many video based applications such as
nitinj janwe@yahoo.com annotation, indexing and retrieval. There is a growing need
Kishor K. Bhoyar in automatically detecting concepts from low-level visual
kkbhoyar@yahoo.com properties by learning the correspondence from loosely
labeled data. Semantic gap [15] is the most challenging
1 Department of Information Technology, YCCE, Nagpur, India problem in concept based video retrieval, which is a gap
N. J. Janwe and K. K. Bhoyar
between the low-level representation of video and higher 2.1.1 Object-level action recognition
level semantic concepts which a user associates.
In implementing a system for semantic video concept Robust action detection and recognition of object-level
detection, the problems reported in the literature are: 1) actions is very useful for video content description, since it
Most current methods based on statistical learning tech- provides unique object-level dynamic information. Bobick
niques [16, 17], use low-level features and struggle in and Davis [18, 19] derived the temporal template repre-
bridging the semantic gap. While the methods based on sentation from background subtracted images. A variety of
deep convolutional neural network did not exploit the fusion choreographed actions across different subjects and view
and the concept of asymmetric training of CNN classi- can be handled. Zelnik and Irani [20] use marginal his-
fiers for dealing imbalanced dataset problem and to improve tograms of spatio-temporal gradients at several temporal
concept detection. 2) Many methods used context relation- scales to cluster and recognize video events.
ship of concepts for improving performance but did not
focus on improving performance exploiting the nature of the 2.1.2 Generic concept detection by classifying video
concept. sequences
There are two main contributions of this paper. Firstly,
we have used a fusion of deep Convolutional Neural To recognize generic concepts in general domains, such
Networks (CNN) for building a classifier and used an as broadcast news videos, consumer videos etc., is a more
idea of asymmetric training to deal with dataset imbal- challenging problem. In such cases, there are often fast mov-
ance problem and to improve concept detection per- ing small objects, large camera motion, articulate motions,
formance. Secondly, we have built a novel foreground significant clutter and object occlusion, and the event of
driven concept co-occurrence matrix (FDCCM) by exploit- interest may involve high-level semantics. Dong and Chang
ing the foreground nature of the concepts to predict [21] use kernel-based discriminative classification to detect
background concepts. It is built using local visual co- a large variety of generic events in news videos. Zhou et
occurrence measure from local training dataset and global al. [22] propose a SIFT-Bag based framework where each
visual co-occurrence measure from a collection of random video clip is encoded as a bag of SIFT feature vectors
images retrieved from Google Images. FDCCM is used [43], and the distribution of each video clip is modeled by
to refine concept prediction scores to improve concept Gaussian Mixture Models (GMMs).
detection.
The reminder of the paper is organized as follows: 2.2 Static key-frame based
Section 2 summarizes the related work. Section 3 describes
various steps of semantic video concept detection. Section 4 In the second approach, video streams are segmented into
presents the details of proposed method. Section 5 gives shots and key-frames of a shot are identified which best rep-
experimental results and performance evaluation and lastly, resent it. Statistical learning models such as support vector
Section 6 concludes the paper. machine and neural network are used for training classi-
fiers. Low-level features are extracted from key-frames and
classifiers are trained. There are three approaches for train-
2 Related work ing, 1) Supervised training: where the concepts are fixed
or known 2) Unsupervised training: if concepts are not
A considerable amount of research has been reported in the known and 3) Semi supervised training: if some concepts
literature to address video processing applications such as are known. Zha et al. [35] presents video concept detec-
semantic video concept detection and video retrieval. This tion system using support vector machine as a supervised
section provides a brief review of the related work in the machine learning tool. Janwe and Bhoyar [7] implemented
area of semantic video concept detection. video concept detection using supervised neural network
model. Memar and Affendey [13] proposed an unsupervised
2.1 Temporal-events and object actions based models concept-based video retrieval system based on the integra-
tion of knowledge-based and corpus-based semantic word
In order to detect temporal events and object actions, a similarity measures, in order to retrieve video shots for
group of methods have been proposed to count temporal concepts whose annotations are not available for the system.
information from video sequences. These approaches can
be roughly classified into those classifying global video 2.2.1 Low-level feature-based classifier
sequences into generic semantic categories (such as “wed-
ding”, “rally”) and other recognizing object-level action Videos comprise of information from both visual and
(e.g. “people walking”, “train running”). audio modalities. Each modality brings some information
Multi-label semantic concept detection in videos using fusion of asymmetrically trained deep...
complementary with the other and their joint processing can 2.3 Multi-label based models
help uncover relationships that are otherwise unavailable.
Besides the use of some conventional low-level visual fea- The approaches in the literature on video concept detection are
tures from color, texture, shape and edge categories, some categorized into single-label and multi-label methods. In the
advanced features such as scale invariant feature transform first type, a sample in the dataset is assigned a label (concept)
SIFT and bag of feature BOF [43] can also be used in video whereas in multi-label or multi-concept methods, multiple
concept detection. All of the methods discussed above are concepts can be associated with a data sample. In multi-label
low-level features based. methods, a classifier is trained using either a binary classi-
fication approach or multi-class classification. Wang et al.
2.2.2 Deep feature-based classifier [34], discussed a transductive multi-label learning approach
for video concept detection. Li [10] discusses implementing
In recent years, because of tremendous increases in com- multi-label image classification with a probabilistic label
puting power, more powerful statistical learning models enhancement model by constructing auxiliary labels.
like deep convolutional neural networks have come up.
These models have drastically improved the robustness of 2.4 Visual co-occurrence based models
computer vision systems. CNNs have shown better scal-
ing properties over conventional machine learning methods Visual co-occurrence assists in detecting concepts unlike
such as SVM, principal component analysis and linear other conceptual and perceptual models [27] such as the
discriminant analysis. The study states that, the generic WordNet distance [28]. It is shown [29] that concept co-
descriptors extracted from the deep CNNs have proven to occurrence strengthens the appearance of concepts in video.
be very powerful [11, 12, 14]. CNN has replaced tradi- The co-occurrence models in the literature are mainly
tional image representations using well designed features image based. The approaches used for concept detection
and yield deep hierarchies of features. Recently, researchers in these models have gained an increasing popularity [30–
are moving more and more towards deep CNNs [23, 24] 32]. Feng and Bhanu [33] measured the relatedness of
as (1) they support large volumes of training data and (2) semantic and visual concepts in the image and built con-
due to the availability of high end computing devices such cept co-occurrence patterns. In [28, 29], pairwise concept
as large array of CPU cores [25] and GPUs. Krizhevsky co-occurrence has been integrated into the concept catego-
et al. [23] showed that excellent recognition accuracy can rization framework by using a co-occurrence matrix. Modiri
be achieved training a large CNN [24] on a large dataset et al. [9] proposed a contextual approach to complex video
by standard back-propagation [26] algorithm. Chien-Hao et classification based on generalized maximum clique prob-
al. [2] implemented high-level image features learning by lem which uses the co-occurrence of concepts as the context
deep convolutional neural network in image retrieval sys- model. In [1], Feng et al. used the concept co-occurrence
tem. Similarly, Podlesnaya and Podlesnyy [3] built video patterns and concept signatures for image annotation and
indexing and retrieval system based on features extracted retrieval application. These approaches have several advan-
by convolutional neural network. McCormac et al. [4] dis- tages over standard concept inference techniques, for exam-
cussed the use of CNN to implement intuitive user inter- ple, incorporating semantic context compensates the ambi-
action by combining CNN and a state of the art dense guity of concept visual appearance. However, the matrix of
Simultaneous Localization and Mapping (SLAM) system. the co-occurrence has an inevitable pairwise constraint on
the relationship.
2.2.3 Combining low-level and deep features Several recent works explore multi-concept learn-
ing/detection techniques for automated image annotation
Kikuchi et al. [5] presented a new feature extraction method that aim to model the co-occurrence information among
where they used a combination of low-level features (e.g., concepts/annotations. In very few works, a space has been
SIFT and HOG) and convolutional neural network (CNN) given on the possible coding of the concept tags or concepts
derived features to represent the meaningful objects that in concept detection methods. Chen et al. [40] proposed a
contributes to the determination of semantic concepts to method about the coding of tags which are based on binary
achieve accurate and robust semantic indexing of videos. numbers and transformed by Chinese characters.
In the literature of semantic video concept detection,
indexing and retrieval systems, TRECVID [47] contributes
a lot in the ongoing research in the field. Awad et al. [6] 3 Semantic video concept detection
presents a detailed review of the work presented and run at
TRECVID for semantic indexing task from 2010 to 2015 for The goal of semantic video concept detection is to detect
content based access to video documents and collections. semantic concepts of a video segment based on its visual
appearance. The pipeline of a complete process of seman- we provide a set of extracted key-frames of video shots
tic video concept detection system is shown in Fig. 1 and as input and the corresponding concepts as output. The
consists of four steps, namely: (1) shot boundary detection multi-label video concept detection task is posed into binary
or shot segmentation, (2) key-frame extraction, (1) classifier classification problem. CNN is used as a baseline.
training, and 4) concept detection.
The input to the system is a video stream. The details of 3.3.1 Overview of CNN
the above four steps are described below.
A neural network is a function g mapping data x, for exam-
3.1 Shot segmentation ple an image, to an output vector y, for example an image
label. The function is the composition of
In order to detect semantic concepts precisely from a video, the sequence of simpler functions fl which are called com-
it needs to be segmented into video shots. Shot exhibits putational blocks or layers. Let x1 , x2 , . . . .xL be the output
strong content correlations between frames; hence shots of each layer in the network. And let x0 = x denote the
are considered to be the basic units in concept detection. network input. Each intermediate output xl = fl (xl−1 ; wl )
Generally shot boundaries are of two types, a cut, where is computed from the previous output xl−1 by applying the
the transition between two consecutive shots is abrupt, and function fl with parameters wl
gradual transitions, where the boundary is stretched over In CNN, the data has a spatial structure: each xl RHl Wl Cl
multiple frames. The examples are dissolve, fade-in and is a 3D array or tensor where the first two dimensions
fade-out etc. The shot boundary detection methods usually Hl (height) and Wl (width) are interpreted as spatial dimen-
extract visual features from each frame and the similarities are sions. The third dimension Cl is interpreted as the number
measured and detect shot boundaries between frames that of feature channels. Hence, the tensor xl represents Hl × Wl
are dissimilar. field of Cl dimensional feature vectors, one for each spa-
tial location. A fourth dimension Nl in the tensor spans
3.2 Key-frame extraction multiple data samples packed in a single batch for effi-
ciency parallel processing. The number of data samples
In video processing applications, the shots are often repre- Nl in a batch is called the batch cardinality. The network
sented by a frame, called a key-frame, which is supposed is called convolutional because the functions fl are local
to be a representative frame for a shot. There are great sim- and translation invariant operators (i.e. non-linear filters)
ilarities among the frames from the same shot; therefore like linear convolution. CNNs are often used as classi-
certain frames that best reflect the shot contents are chosen fiers or regressors. The output of CNN is ŷ = f (x) is
as key-frames. Mostly, the middle frame of a shot is taken a vector of probabilities, one for each of an N possi-
as a key-frame, assuming that middle segment contains key ble image labels (dog, cat, trilobite, etc.). If y is the true
contents, but many other techniques do exist. It is not nec- label of image x, we can measure the CNN performance
essary that a shot is always represented by a single frame; in by a loss function ly ŷ ∈ R which assigns a penalty
some cases of more visual complexity multiple key-frames to classification errors. The CNN parameters can then be
are required. For such cases, approaches like unsupervised tuned or learned to minimize this loss averaged over a
clustering can be used, where frames in a shot are clustered large dataset of labelled example images. Learning a CNN
depending on the variation in shot content and then choose requires computing the derivative of the loss with respect
the frame closest to the cluster center as a key-frame. Each to the network parameters. Derivatives are computed using
cluster is represented by a unique key-frame. So a single an algorithm called back-propagation. CNNs are inherently
shot can have multiple key-frames. The choice of a key- translation invariant. Their basic components (convolution,
frame may also depend on the object or the event, one is pooling, and activation functions) operate on local input
interested in. The frame that best represents the object or the regions, and depend only on relative spatial coordinates.
event can be chosen as a key-frame. The details of the most important CNN operations is as
follows:
3.3 Training a CNN classifier
(1) Convolution: The convolutional block is implemented
Automatic multi-label semantic concept detection for by the function y =convolution(x, f, b), computes the
videos is a supervised machine learning problem, where convolution of the input map x with a bank of K
Fig. 1 Pipeline of a semantic

video concept detection system
multi-dimensional filters f and biases b. Here, x ∈ channels. Note that input x and output y have the

RH ×W ×D , f ∈ RH ×W ×D×D , y ∈ RH ×W ×D . same dimensions.
The output of convolving a signal for 1D slice, is • Spatial Normalization: The spatial normalization
given by following (1), operator acts on different feature channels inde-

pendently and rescales each input feature by the

H
W
D
energy of the features in a local neighborhood.
yi j d = bd + fi j d × xi +i −1,j +j −1,d ,d .
First, the energy of the features in a neighborhood
i =1 j =1 d =1
W × H is evaluated by (8)
(1)
(2) Padding and Stride: Convolution function allows to spec- 1
n2i j d = x 2
ify top-bottom-left-right paddings (Ph− , Ph+ , Pw− , Pw+ ) W H i +i −1− H −1
,j +j −1− W −1
,d.
1≤i ≤H ,1≤j ≤W 2 2
of the input array and subsampling strides (Sh ,Sw ) of
(8)
the output array:
(6) Softmax: Computes the softmax operator:

H
W
D
yi j d = bd + fi j d × xSh (i −1)+i −P − ,Sw (j −1)+j −Pw− ,d ,d . exij k
i =1 j =1 d =1
h yij k = D
(9)
xij t
(2) t=1 e
Note that the operator is applied across feature chan-
(3) Spatial Pooling: It is of two types; max and sum nels and in a convolutional manner at all spatial loca-
pooling tions. Softmax can be seen as the combination of an
activation function (exponential) and a normalization
• Max Pooling: Computes the maximum response
operator.
of each feature channel in a H × W patch by (3)
yi j d = max1≤i ≤H ,1≤j ≤W Xi +i −1,j +j −1,d . 3.4 Concept detection

(3)
In CNN learning, given a key-frame Ki of size h × w × c

resulting in an output of size y ∈ RH ×W ×D of a video shot or a representative key-frame i, the aim is to
• Sum Pooling: Computes the average of the values obtain a measure, which indicates whether a key-frame Ki
by following (4) (for shot i) belongs to a set of semantic concepts C or not
where C is a subset of concept vocabulary. In CNN learning,
1
yi j d = xi +i −1,j +j −1,d. in the training phase, it has to be trained by supplying a set
W H
1≤i ≤H ,1≤j ≤W of key-frames, and in the test phase, the classifier assigns
(4) a probability p(Cj |yi ) to each semantic concept of concept
vocabulary. Hence, by ranking the probability scores of each
(4) Activation Functions: concept, the algorithm finds the presence of concepts in a
• ReLU: Computes the Rectified Linear Unit. test key-frame.
Once the CNN classifier is trained using the training
yij d = max{0, xij d }. (5) dataset of key-frames, the classifier outputs prediction prob-
•
Sigmoid: Computes sigmoid. ability scores for concepts in the vocabulary. These scores
1 determine the probability of presence of a respective con-
yij d = σ xij d = (6) cept in a test key-frame. The scores are ranked in descending
1 + e−x ij d
order and the top-k concepts are considered as inferred
(5) Normalization: concepts.
• Local Response Normalization (LRN): This is In CNN, the probability estimation of concepts is done
applied independently at each spatial location and by the output of a last layer i.e. Softmax layer. In the simi-
to groups of feature channel and is computed by lar systems in the literature, Hidden Markov Model (HMM)
(7) for probability estimations [44] or confidence-estimation
⎛ ⎞−β methods [45] can be used for probability estimations. Some
conventional probability estimator systems built estimators
yij k = xij k ⎝k + α xij2 t ⎠ (7) from a given training dataset and then evaluated on a holdout
t∈G(k)
testing set from the same dataset to display its effective-
where, for each output channel k, G(k) ⊂ ness. An estimator built on a specific dataset usually cannot
{1, 2, ..., D} is a corresponding subset of input be directly applied to make evaluations on other datasets
because these are heterogeneous. Tian et al. [41] developed that classes with sufficiently large sample size (strong
a correlation component manifold space learning (CCMSL) classes) tend to learn quickly and give better detection rate
approach to first learn a common feature space by captur- in the early stage of the classifier training than the classes
ing the correlations between the heterogeneous databases, with smaller sample size (weak classes). Therefore, in order
and then in the resulting space establish a single age esti- to achieve better detection rate for weak classes a classi-
mator across such heterogeneous datasets through correla- fier needs to be trained for many more epochs. But, if the
tion representation learning (CRL). As a result, probability classifier is trained further, we may lose the detection per-
distribution-incompleteness of individual datasets are com- formance of strong concepts because of overfitting problem.
pensated, but also the discriminating ability of the estimator To tackle this difficulty, we suggest to divide the dataset
is reinforced. classes into two groups by applying the Global Thresh-
olding method given in Algorithm 1. The first group will
contain the concepts from cluster-1 having smaller popu-
4 Proposed method lation size and second group will have the concepts from
cluster-2 with larger sample population size. The proposed
The framework of the proposed system is given in Fig. 2. method uses two different CNNs for two groups of classes.
The figure also highlights the contributions of the paper. These are trained independently with the same dataset so
The first contribution is made at the classifier level, where that the detection rate of the concepts it is trained for will be
a fusion of asymmetrically trained CNNs is proposed to the highest. After this asymmetric training, the concept pre-
build a classifier to tackle dataset imbalance problem. The diction scores of both CNNs are fused to get final scores as
second contribution is the use of foreground driven con- shown in Fig. 3.
cept co-occurrence matrix (FDCCM) after the classification This approach can be extended further for more numbers
stage, to refine concept prediction scores to further improve of groups, depending on the variation in the sample size.
detection performance. But as we go on increasing the groups, in worst case, we
would have a class (concept) per group and would result
4.1 Fusion of asymmetrically trained deep CNNs in a dedicated concept detector. Another drawback is the
increased time complexity, as it requires more CNNs to be
The asymmetric training approach is very useful in datasets used to build a classifier.
where the number of training samples of each class are con- In the proposed method, a classifier is suggested to be
siderably different. Such a dataset is known as imbalance built using a fusion of asymmetrically trained deep CNNs.
dataset. The rationale behind the approach is the observation The CNNs are built using the network shown in Fig. 4.
Concept Classifier
Test key-frames Contribution #1
(Fusing asymmetrically trained CNNs)
dataset Detected concepts
CNN-1 Face
Outdoor
+
Person
CNN-2 Sky
Asymmetrically trained CNNs Vegetation
Ranked
concept
Training prediction
key-frames scores
dataset
Concept Probability Concept Probability
Face 0.523 Face 0.523
Crowd 0.428 Sky 0.476
Key-frame extraction
Person 0.254 Vegetation 0.345
……… …….. ……… ……..
Refined concepts
Contribution #2
Truck Road Sky Studio Snow C1 C2 C3 C4 C5 C6 - C36
Studio Car Tree Bus Sports C1 1 0 0 0.5 0.6 0.9 - 0
Training Video C2 0 1 0.5 0 0 0 - 0.5
Sports Tree Person Road Studio
C3 - - 1 - - - - -
Video segmentation Car Sports Road Bus Tree
- - - - - - - - -
Dataset Road Person Studio Car Sports C36 - - - - - - - 1
Concept set FDCCM
Fig. 2 Flowchart of the proposed semantic video concept detection method using convolutional neural network (CNN) and foreground driven
concept co-occurrence matrix (FDCCM)
4.1.1 Architecture used for convolutional neural network features from the top convolutional layer as input in vector
form. C-way Softmax is the last layer where C being the
The architecture of the CNN is comprised of 7 layers. A key- number of classes which is 36 in our case.
frame of size 352 × 288 (with three color channels), first
resampled to 88 × 72 and then converted to gray-scale (sin- 4.2 Concept co-occurrence matrix (CCM)
gle channel) is presented as input. This is convolved with 40
different first layer square filters, each of size 6 × 6 using a This section presents the introduction of CCM and discusses
stride of 2 in both x and y. The resulting feature maps are the detailed algorithm to construct it. Following subsection
then: (i) pooled (max within 2 × 2 regions, using stride 2), explains what FDCCM is and how it is derived from CCM.
(ii) contrast normalized across feature maps and (iii) passed Visual co-occurrence assists in providing semantic hints
through a rectified linear function, to give 40 different 17 in detecting concepts. It has been shown that the appear-
× 21 element feature maps. Similar operations are repeated ance of each concept is consolidated by their co-occurrence
in layer 2. The last two layers are fully connected, taking in a video shot. Existing approaches consider co-occurrence
Fig. 3 A fusion of
asymmetrically trained CNN
classifiers
of pairs of concepts. If “Road” & “Outdoor” is a co- means the presence of concept “Airplane” consolidates the
occurrence pair, then the probability of presence of “Out- presence of concepts “Bus” by a factor of 0.5, “Face” by
door” can be strengthened by a strong confidence of 0.6, “Outdoor” by 0.9 and “Sports” by 0.5. The proba-
“Road”. Since we are dealing with multi-label data in our bility of these concepts will be strengthened by respective
approach, we have extended the pairwise co-occurrence co-occurrence values.
concept to multi-concept co-occurrence. For example, for In our work, CCM is implemented using two data col-
a concept “Car”, if the confidence of co-occurrence pairs, lections. In this section, we discuss the construction of
“Car-Road”, “Car-Outdoor”, “Car-Vegetation” and “Car- the CCM using a training video dataset. The video shot
Sky” is significant in the training data, then the strong segmentation and key-frame extraction steps, transforms
confidence of concept “Car” in the test sample could con- a training video dataset into key-frame dataset. The key-
solidate the presence of multiple concepts “Road”, “Out- frames dataset is pre-annotated. We have used TRECVID
door”, “Vegetation” and “Sky” by appropriate proportions [47] development key-frame dataset supplied by NIST [46]
depending on co-occurrence values. We need to construct as a ground-truth for training dataset. We built a concept
CCM to keep co-occurrence data. co-occurrence matrix from the annotated key-frames of a
CCM is a matrix of size nx n, where n are the number of training video dataset. The key-frames in the ground-truth
concepts in the vocabulary set. It keeps the information of dataset are multi-concept key-frames i.e. a single key-
the co-occurrence pairs and their co-occurrence values. The frames has got multiple concepts. Let = {c1 , c2 , . . . .., cn }
structure of CCM is shown in Table 1 and is generated using be the concept vocabulary in the training key-frames dataset,
Algorithm 2. The highlighted row in the CCM, indicates that where n is total number of unique concepts annotated to the
the presence of a concept “Airplane” also predicts the prob- key-frames that the system is attempting to detect. Hence
ability of presence of concepts “Bus”, “Face”, “Outdoor” the size of the concept co-occurrence matrix is n × n. Let
and “Sports” by 0.5, 0.6, 0.9 and 0.5 respectively. It also T = {kf1 , kf2 , . . . . . . , kfm ) denote the training key-frame
Convolution-1 Convolution-2 Convolution-3 Convolution-4 Convolution-5 Convolution-6 (Concept Detection)
88 x 72 42 x 34 10 x 8 8x6 7x5 3x3 1 x 36
filter size
6 300 36
60 100 500
40 class
2x2 2x2
Stride 2 2x2 Stride 2 Stride 1 Stride 1 softmax
max pool avg. pool
max pool
21
9 Contrast Stride 1
1 Contrast
Norm. Norm. 6 1
3 2 4
17 2 1
7
2 conv. 1 conv. 1 conv.
1 conv.
100 1 conv.
40 60 300 500
ReLU ReLU ReLU
Layer-1 Layer-2 Layer-3 Layer-4 Layer-5 Layer-6 Softmax Layer
Fig. 4 The architecture of CNN used in proposed method

dataset with size m. The CCM is constructed by count- numbers of values each gives the co-occurrence count for
ing the total number of co-occurrences of each concept the concepts c1 to cn . Table 2 shows the concepts and their
ci for each individual concept cj for the m key-frames range for the training dataset used to construct CCM. Figure 5
in the training dataset. There are n rows in a matrix. A shows the CCM constructed from TRECVID dataset using
row r1 represents the concept c1 and consists of total n Algorithm 2.
The CCM used is a combination of two CCMs computed obtained by combining two matrices by taking their average
at two levels, namely, local visual co-occurrence level and as given by (10).
global visual co-occurrence level. CCM is evaluated at local
visual co-occurrence level from the local pre-labelled train- CCM Z [I, J ] = Avg (CCM X [I, J ] , CCM Y [I, J ]) (10)
ing dataset and at global visual co-occurrence level from a
collection of random images retrieved from Google Images, Where CCM X and CCM Y are constructed at local level
and are maintained in two separate CCMs. The final CCM is and global level respectively and CCM Z is final resultant
Table 1 CCM along with co-occurrence values
Airplane Animal Car Bus Face Outdoor Sports – – C36
Airplane 1 0 0 0.5 0.6 0.9 0.5 0

Animal 0 1 0.5 0 0 0 0 0.5
Car 0.12 0.07 1 0.4 0.65 1 0.23 0.19
Bus 0.07 0.01 0.2 1 0.3 0.4 1 0.32
– – – – – – – – 1 – –
C36 1
co-occurrence matrix. I and J are rows and columns of the it is always possible to predict background concept if fore-
matrix. ground concept is available, e.g. if an image consists of
concept Airplane, then we can easily predict the concept
4.2.1 Foreground driven concept co-occurrence matrix Sky in the background, whereas the Airplane concept can-
(FDCCM) not always be predicted with the Sky in the background.
Hence, out of two co-occurrence pairs Airplane-Sky and
The CCM we prepared gives the co-occurrence values of Sky-Airplane, first pair is meaningful and is of type Fore-
every pair of concepts in concept vocabulary and is used to ground driven.
improve concept detection. If we observe a list of concepts, Inspired by this observation, we have proposed a novel
we find that it consists mainly of two types, 1) Foreground CCM, named Foreground driven Concept Co-Occurrence
or active concepts: the concepts or objects in a key-frame of Matrix (FDCCM) which consists of rows of type foreground
foreground nature (e.g. Airplane, Car or Bus) and 2) Back- only. FDCCM is used for refining concept probabilities
ground or passive concepts: these concepts are always a only when foreground concepts are found in a list of top
part of background scenery (e.g. Sky and Vegetation). Here, ranked scores. FDCCM has the advantage that the negative
we propose a special type of CCM which considers above impact of predicting foreground concepts on the basis of
division of concepts, we can divide co-occurrence pairs background concept is reduced.
into four types 1) Foreground-Background concept pair 2) To generate FDCCM, we have identified twelve concepts
Background-Background pair 3) Background-Foreground out of 36 of foreground nature from the concept vocabulary.
pair and 4) Foreground-Foreground pair. We do observe that Table 3 shows a list of identified foreground concepts.
Table 2 The training dataset with a range of start.to end key-frame no. for each concept
SN Concept From To SN Concept From To
1 Airplane 1 24 19 Natural-Disaster 6553 6588

2 Animal 25 714 20 Office 6589 7278
3 Boat Ship 715 967 21 Outdoor 7279 8816
4 Building 968 1828 22 People-Marching 8817 9053
5 Bus 1829 1863 23 Person 9054 10487
6 car 1864 2284 24 Police Security 10488 10609
7 Charts 2285 2346 25 Prisoner 10610 10624
8 Computer TV-screen 2347 2688 26 Road 10625 11284
9 Court 2689 2811 27 Sky 11285 12592
10 Crowd 2812 3984 28 Snow 12593 12709
11 Desert 3985 4045 29 Sports 12710 12955
12 Explosion Fire 4046 4083 30 Studio 12956 13086
13 Face 4084 5454 31 Truck 13087 13187
14 Flag-US 5455 5461 32 Urban 13188 14064
15 Maps 5462 5567 33 Vegetation 14065 15366
16 Meeting 5568 6090 34 Walking Running 15367 16422
17 Military 6091 6438 35 Waterscape Waterfront 16423 17086
18 Mountain 6439 6552 36 Weather 17087 17114
Fig. 5 The CCM generated from TRECVID dataset using Algorithm 2
presents FDCCM created with above approach. An alternate

With above approach, CCM is converted to FDCCM clustering method of self-adaptive approach based on global
by considering only concepts given in Table 3. Figure 6 best [42] can also be used to cluster concepts in foreground
and background categories.
4.3 Refining concept prediction scores

Table 3 A list of foreground concepts
The procedure of refining concept prediction scores is given
Sr. No. Concept No. Concept Name
in Fig. 7. It shows that in testing phase the classifier outputs
1 1 Airplane concept prediction scores for all the concepts in concept
2 2 Animal vocabulary. These scores are then rearranged in descending
3 3 Boat Ship order. Top value in a ranked list indicates highest prob-
4 5 Bus ability of presence of a concept and so on. Initially, we
5 6 Car considered top n scores from a ranked list where n is
6 14 Flag-US the number of concepts in the ground-truth for a given
7 22 People-Marching key-frame. Our hypothesis is that the n concepts of the
8 23 Person ground-truth data will lie in the top n scores of a ranked
9 24 Police Security list.
10 25 Prisoner To improve concept detection rate, we make use
11 31 Truck of co-occurrence information; i.e. concept co-occurrence
12 34 Walking Running matrix to refine concept prediction scores. In the refining
Fig. 6 The FDCCM derived from CCM using above approach
Fig. 7 The procedure of refining concept prediction scores using FDCCM (keeping n = 6)
Table 4 Partition details of TRECVID development dataset 5 Experimental results

Dataset Partitions # of videos # of key-frames
In this section, the experimental setup i.e. the training
TRECVid Dev. Training dataset 90 17114 and testing dataset required to implement proposed method
Dataset Testing dataset 20 9352 and the performance measures to evaluate the results are
explained in detail.
5.1 Video datasets
procedure, a top score concept is picked-up from a ranked In our experimentation, the dataset used for training and
list of concepts and using FDCCM, concept prediction testing the CNN classifier is TRECVID. Once CNN is
scores are updated according to the (11). This process is trained, its performance can be tested using testing dataset.
repeated for all the top k concepts of the list and the updated The performance measures used are Average Precision (AP)
(refined) values are again re-ranked. This re-ranked list of and Mean Average Precision (MAP). The details of the
scores of concepts are the final detected concepts. dataset and performance measures are as follows:
4.4 Computing refined concept prediction scores 5.1.1 TRECVID datasets and ground-truth data
The problem of refining concept prediction scores obtained The National Institute of Standards and Technology (NIST)
in the output of CNN classifier fits as a problem of finding [46] is responsible for the annual Text Retrieval Confer-
a union of fuzzy sets using an equation for algebraic sum ence (TREC) Video Retrieval Evaluation (TRECVID) [47]
for S-Norm. CCM is considered as a first fuzzy set of pos- since 2001. Every year, it provides a test collection of
sibility distribution for concept co-occurrence and the CNN video datasets along with a task list. It focuses its efforts
output, a set of concept prediction scores to be refined is to promote progress in video analysis and retrieval. It
considered as the second fuzzy set. The refined scores are also provides ground-truth for genuine researchers. Many
obtained by computing a fuzzy set union by an equation of researchers and research teams participate and contribute in
algebraic sum for S-norm given by (11), yearly organized conferences and workshops.
5.1.2 Dataset selection

Scoreref ined = U (Scoreold ,Wi )
We have used TRECVID 2007 development dataset in
= Scoreold +Wi − (Scoreold ×Wi ) (11)
our experimentation. It is composed of 110 video clips.
The videos are partitioned into 19,140 shots and 6,64,850
key-frames. As shown in Table 4, the Dev. dataset is par-
Where Scoreref ined is the refined score, Scoreold is the titioned into two parts, Training and Test dataset. Training
non-refined concept prediction score and Wi is the co- dataset consists of 17,114 randomly chosen positive key-
occurrence value in the FDCCM. frames from first 90 videos to perform classifier training
Table 5 Concept list in the dataset
Id Concept Id Concept Id Concept Id Concept
1 Airplane 10 Crowd 19 Natural-Disaster 28 Snow

2 Animal 11 Desert 20 Office 29 Sports
3 Boat ship 12 Explosion Fire 21 Outdoor 30 Studio
4 Building 13 Face 22 People-Marching 31 Truck
5 Bus 14 Flag-US 23 Person 32 Urban
6 Car 15 Maps 24 Police-Security 33 Vegetation
7 Charts 16 Meeting 25 Prisoner 34 Walking-Running
8 Computer TV-screen 17 Military 26 Road 35 Waterscape-Waterfront
9 Court 18 Mountain 27 Sky 36 Weather
and Test dataset consists of 9,352 randomly chosen pos- 2. Pick-up top di scores from a ranked list, let Pi is
itive key-frames from next 20 videos and is used to test concept list from top d concepts. The intersection Mi
classifier performance. There are 36 defined concepts in between Yi & Pi are the detected concepts.
the dataset. The concept list is given in Table 5. NIST
Let Ni is the label density for xi , and intersection
provides ground-truth data for the above dataset which has
between Yi and Pi is Mi , then Mi are correctly predicted
manually annotated 36 concepts over these key-frames. The
concepts by a classifier.
ground-truth consists of video shots and their representative
The average precision (AP) for a test sample is computed
key-frame/s. It is noted that a shot in a video clip may have
by (12),
one or more positive and/or negative key-frames. For a con-
cept, positive key-frame is defined as a frame containing a
said concept as a visual content.
|Yi Pi | Mi
The ground-truth dataset consists of both positive as well APi = = (12)
as negative examples. For the experimentation, the ground- |Pi | Ni
truth data for the development dataset is used. Figure 8
illustrates the distribution of positive examples for each
individual 36 concepts in the Training dataset and Table 6 And the Top-d MAP for a classifier, H , on dataset D, is
presents concept defining key-frames from the TRECVID obtained by computing the mean of APs by (13),
training dataset.
5.2 Evaluation measures 1 |D| |Yi Pi |

MAP (H, D) = (13)
|D| i=1 |Pi |
The performance evaluation of the proposed concept detec-
tion method is done by the measure Mean Average Precision 5.3 MatConvNet
(MAP) of Top-d detected concepts from the concept vocab-
ulary list. The experimentation on CNN is performed in Matlab16a
and using open source library: MatConvNet [8] version 1.0
5.2.1 Computing Top-D MAP beta 20. MatConvNet is a novel framework for experiment-
ing with deep CNN in Matlab. It allows fast prototyping
The ground-truth key-frames consists of multi-label data. of new CNN architectures; at the same time, it supports
We preprocess the ground-truth test dataset as follows: the efficient computation on CPUs and GPUs allowing to train
concept set (label set), Yi, and concept count or label den- complex models on large datasets. It is widely used for CNN
sity, Ni , for each test sample, xi , are computed. Let D be implementation.
a multi-label test dataset, consisting of |D| multi-label test
examples (xi , Yi ), i = 1 . . . |D|, Yi ⊆ L, where L is a con- 5.4 Experimental setup & result analysis
cept vocabulary set for a dataset. When concept prediction
scores for all the L concepts are obtained, the following The proposed method has two stages. In the first stage,
procedure is adopted to compute Top-d MAP: CNN-based concept detector is implemented and in the sec-
ond stage, concept prediction scores are refined using Fore-
1. Rank the prediction scores and concepts in descending ground driven Concept Co-occurrence Matrix for exploiting
order for a test sample xi . context relationship.
Fig. 8 Number of positive frames in TRECVID training dataset

Table 6 Concept defining key-frames from the TRECVID development dataset
Concept Concept Definition Examples
Airplane Segment contains a shot of an airplane
Boat Ship Segment contains a shot of a boat or ship
Building Segment contains a shot of an exterior of a building
Car Segment contains a shot of a car
Crowd Segment contains a shot depicting a crowd
Face Segment contains a shot depicting a face
Road Segment contains a shot depicting a road
Sports Segment contains a shot depicting any sport in action
Snow Segment contains a shot depicting snow
Walking Running Segment contains a shot depicting a person walking or running

Table 7 Number of positive samples per concept in the training dataset. The Group-1 and Group-2 concepts are marked by green and orange
color respectively
Id Concept # Id Concept # Id Concept #

1 Airplane 24 13 Face 1371 25 Prisoner 15
2 Animal 690 14 Flag-US 7 26 Road 660
3 Boat ship 253 15 Maps 106 27 Sky 1308
4 Building 861 16 Meeting 523 28 Snow 117
5 Bus 35 17 Military 348 29 Sports 246
6 Car 421 18 Mountain 114 30 Studio 131
7 Charts 62 19 Natural-Disaster 36 31 Truck 101
8 Computer TV-screen 342 20 Office 690 32 Urban 877
9 Court 123 21 Outdoor 1538 33 Vegetation 1302
10 Crowd 1173 22 People-Marching 237 34 Walking-Running 1056
11 Desert 61 23 Person 1434 35 Waterscape-Waterfront 664
12 Explosion Fire 38 24 Police-Security 122 36 Weather 28
Table 8 MAP with Top-d performance measures for random three CNN2 classifier models trained with different epochs
Methods # of epochs for CNN1 = 4 # of epochs for CNN1 = 4 Classifier for proposed method
# of epochs for CNN2 = 36 # of epochs for CNN2 = 76 # of epochs for CNN1 = 4
# of epochs for CNN2 = 187
MAP MAP MAP
Top-4 Top-5 Top-7 Top-n Top-4 Top-5 Top-7 Top-n Top-4 Top-5 Top-7 Top-n
Concept detection using individual CNNs

Scnn4 0.48 0.43 0.36 0.52 0.48 0.43 0.36 0.52 0.48 0.43 0.36 0.52
Scnn187 0.40 0.37 0.33 0.45 0.36 0.33 0.30 0.39 0.31 0.30 0.27 0.32
Concept detection using fusion of CNNs
Fcnn 0.48 0.42 0.36 0.56 0.47 0.42 0.36 0.55 0.51 0.45 0.37 0.55
Concept detection using individual CNN and CCM
Sccm 0.47 0.43 0.38 0.47 0.44 0.41 0.36 0.43 0.43 0.40 0.35 0.43
Concept detection using fusion of CNNs and CCM
Fccm 0.48 0.44 0.37 0.47 0.47 0.43 0.37 0.48 0.45 0.42 0.36 0.49
Concept detection using proposed method i.e. using fusion of CNNs and FDCCM
Proposed FFDCCM 0.49 0.44 0.37 0.57 0.50 0.44 0.37 0.61 0.50 0.45 0.38 0.65
Values in bold indicate the results of the proposed method for the classifier used
0.7
0.6
0.5
0.7 0.4
MAP
Top-4
0.6 0.3 Top-5

Top-7
0.5 Top-N
0.2
0.4
MAP
Proposed 0.1
0.3
method
performance
0.2 0
Scnn4 Scnn187 Sccm Fcnn Fccm Proposed
0.1 Ffdccm
CONCEPT DETECTION MEHODS
0
1 2 3 4 5 6 7 8 9 10 15 20 n
'n' Fig. 10 Video concept detection performance of the approaches used
on the TRECVID dataset and measured by Top-d MAP on the CNN1
Fig. 9 Selection of best value of ‘n’ for proposed method = 4 epochs & CNN2 = 187 epochs
Table 9 Comparison of results in terms of the predicted concepts on five test key-frames by proposed and other methods
Key-frame Id 1524 2846 4600 7434 7787

Ground-truth Building, Face, Outdoor, Building, Face, Outdoor, Boat Ship, Face, Outdoor, Building, Face, Outdoor,
concepts Person, Sky, Urban, Vegetation, Person, Urban, Outdoor, Sports, Person, Road, Person, Sky, Urban,
Walking Running Walking Running Waterscape Waterfront Urban, Walking Running Walking Running
# (N) 8 6 4 6 7
Scnn4 Face, Outdoor, Person, Sky, Face, Outdoor, Person, Outdoor Face, Outdoor, Person, Face, Outdoor, Person,
Urban Urban Urban Sky, Urban
AP 0.63 0.67 0.25 0.67 0.71

Scnn187 Face, Outdoor, Person, Sky Building, Outdoor, Boat Ship, Outdoor, Urban Building, Outdoor, Sky,
Walking Running Waterscape Waterfront Walking Running
AP 0.50 0.50 0.50 0.33 0.57

Sccm Building, Face, Outdoor, Building, Outdoor, Person, Boat Ship, Outdoor Face, Outdoor, Urban Face, Outdoor, Person, Sky,
Person, Sky, Urban, Vegetation Urban, Vegetation Walking Running
AP 0.88 0.83 0.50 0.50 0.71

Fcnn Face, Outdoor, Person, Face, Outdoor, Person Boat Ship, Outdoor, Face, Outdoor, Person Face, Outdoor, Person, Sky
Sky, Urban, Vegetation, Waterscape Waterfront

Walking Running
AP 0.88 0.50 0.75 0.50 0.57
Fccm Building, Face, Outdoor, Building, Face, Outdoor, Waterscape Waterfront Face, Outdoor, Person, Face, Outdoor, Person, Sky
Person, Sky, Urban, Vegetation, Outdoor, Person, Walking Running
AP 0.88 0.67 0.50 0.67 0.57

ProposedFFDCCM Building, Face, Outdoor, Person, Building, Face, Outdoor, Boat Ship, Outdoor, Face, Outdoor, Person, Building, Face, Outdoor, Person,
Sky, Urban, Vegetation, Person, Urban, Walking Running Waterscape Waterfront Urban, Walking Running Sky, Urban, Walking Running
Walking Running
AP 1.00 1.00 0.75 0.83 1.00
Fig. 11 Performance 1.2

comparison of proposed and
other methods used on sample 1
AVERAGE PRECISION
test key-frames
0.8
Scnn4
Scnn187
0.6
Sccm
Fcnn
0.4
Fccm
Proposed Ffdccm
0.2
0
KF-1524 KF-2846 KF-4600 KF-7434 KF-7787
TEST KEY-FRAMES
5.4.1 CNN-based concept detectors from C1 and scores of Group-2 concepts from C2 are
combined to yield the final prediction scores. These final
We performed two experiments to build CNN-based con- merged scores are used to predict the concepts. The per-
cept detector. In the first, a method Scnn4 is implemented formance of the fused classifier is evaluated over the test
using a classifier C1 of type deep CNN. C1 is trained for dataset using TOP-d MAP values.
4 epochs with key-frames training dataset and its detection
performance is measured over testing dataset. The output of 5.4.2 Refining concept scores using FDCCM
the classifier C1 , as the last layer of the network shows, is a
set of concept prediction scores; and gives us the probability In this experimentation, to exploit context relationship
of presence of individual concepts. The concept prediction among concepts, CCM and later FDCCM is derived from
scores need to be ranked in descending order to measure the training dataset and random image dataset from Google
performance and is evaluated using MAP of Top-d measure images using Algorithm 2. FDCCM is used to refine
for different values of d. concept probabilities. In the first part of the experimen-
In the second experimentation, a classifier C2 is created tation, a method Sccm is implemented to combine CNN
with the same network structure used for C1 . It is trained and FDCCM and performance is tested. In the second
for sufficiently large number of epochs so that the concepts part, a proposed method Ff dccm is implemented where we
with smaller data size should be recognized. C2 is trained investigated combined application of classifier fusion and
for 187 epochs with the same training dataset. Its detection FDCCM. The performance of these methods are tested and
performance is measured separately as with C1 . We name measured on all the three sample CNN classifiers trained
this method Scnn187 . with 36, 76 and 187 epochs and is given collectively in
In the third experimentation, in an attempt to further Table 8. The results show that the MAP for all the Top-
improve detection rate, we implemented a method Fcnn187 d measures for all of the methods for the CNN classifier
where C1 and C2 are fused together. C1 is used to detect trained with 187 epochs is better than rest of the two
concepts from Group-1 and C2 is used to detect con- classifiers. Hence we have chosen second CNN2 classifier
cepts from Group-2 as shown in Table 7. In the classifier with 187 epochs whereas the first CNN1 is trained for 4
fusion process, the prediction scores of Group-1 concepts epochs.
Table 10 Performance comparison of proposed method with state-of-the-art other methods, showing superiority of our approach
Method Publication year Database used Classifier MAP (Top-N)
Proposed Method FFDCCM – TRECVID2007 CNNs 0.65

TagBook [36] 2016 CVV No trained classifier 0.52
NN based Concept detector [7] 2016 TRECVID2007 NN 0.49
HCI-IP 2016 LabelMe CNN 0.40
Cascading CNN & other local descriptors [37] 2015 TRECVID 2013 CNN & Linear SVM 0.28
DMTL LC [38] 2016 TRECVID 2013 Deep multi-task learning 0.25
13 M A NTT DUT 4 4 [39] 2016 TRECVID 2013 Deep learning 0.05
In our further experimentation, we thoroughly investi- far better than the other existing techniques reported in the
gated the optimal length of top values to be considered literature.
from ranked list for refinement of concepts where classifier
performance is optimum. Interestingly, the optimal value of
the ranked list of scores to be considered is found it to be 6. References
The detailed result is given in Fig. 9. In all our experimen-
tation, the MAP values are computed keeping this length to 1. Feng L, Bhanu B (2016) Semantic concept co-occurrence patterns
top 6 scores. for image annotation and retrieval. IEEE Trans Pattern Anal Mach
Intell 38(2):785–799
From Fig. 10, it is observed that, the proposed method
2. Kuo CH, Chou YH, Chang PC (2016) Using deep convolu-
outperforms all other methods. tional neural networks for image retrieval. Soc Imag Sci Technol.
Table 9 presents the comparison of results for the pro- https://doi.org/10.2352/ISSN.2470-1173.2016.2.VIPC-231
posed method and other experimented methods with respect 3. Podlesnaya A, Podlesnyy S (2016) Deep learning based semantic
video indexing and retrieval. arXiv:1601.07754 [cs.IR]
to concepts detected and the average precision (AP) for the
4. McCormac J, Handa A, Davison A, Leutenegger S (2016) Seman-
five sample test key-frames from test dataset. Figure 11 ticFusion: dense 3D semantic mapping with convolutional neural
shows the superiority of proposed method. networks. arXiv:1609.05130v2 [cs.CV]
We have also compared the performance of the pro- 5. Kikuchi K, Ueki K, Ogawa T, Kobayashi T (2016) Video seman-
tic indexing using object detection-derived features. In: Proc. 24th
posed method with other existing state-of-the-art methods.
European signal processing conference (EUSIPCO). Budapest, pp
Table 10 shows that the proposed method exhibits outstand- 1288–1292
ing performance over the other state-of-the-art methods in 6. Awad G, Snoek CGM, Smeaton AF, Quénot G (2016) TRECVid
the area. semantic indexing of video: a 6-year retrospective. ITE Trans Med
Technol Appl (MTA) 4(1):187–208
The work in this paper can be extended to heterogeneous
7. Janwe NJ, Bhoyar KK (2016) Neural network based multi-label
training datasets where a concept classifier is trained using semantic video concept detection using novel mixed-hybrid-
two different training datasets to compensate the concept fusion approach. In: Proceedings of the 2nd international confer-
estimation incompleteness of individual datasets and to rein- ence on communication and information processing, ICCIP 2016.
force discriminating ability of the estimator [41]. It can also ACM, Singapore, pp 129–133
8. Vedaldi A, Lenc K (2015) MatConvNet: convolutional neural net-
be extended to support sharable environment such as cloud works for MATLAB. In: Proc. of the int. conf. on multimedia.
infrastructure [44]. ACM, pp 689-692. https://doi.org/10.1145/2733373.2807412
9. Modiri S, Amir A, Zamir R, Shah M (2014) Video
classification using semantic concept co-occurrences.
https://doi.org/10.1109/CVPR.2014.324
6 Conclusion 10. Li X, Zhao F, Guo Y (2014) Multi-label image classification with
a probabilistic label enhancement model. In: UAI’14 Proceedings
In this paper, a novel contribution in the area of multi-label of the thirtieth conference on uncertainty in artificial intelligence,
semantic video concept detection has been presented. It pp 430-439
11. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E,
introduced a framework for multi-label video concept detec- Darrell T (2014) Decaf: a deep convolutional activation feature
tion using state-of-the-art asymmetrically trained fusion of for generic visual recognition. In: Proceedings of the interna-
convolutional neural networks and foreground driven con- tional conference on machine learning, ICML. Beijing, pp 647–
cept co-occurrence matrix, FDCCM. FDCCM keeps the 655
12. Zeiler MD, Fergus R (2013) Visualizing and understanding con-
co-occurrence data for each foreground concept with other volutional networks. arXiv:1311.2901 [cs.CV]
concepts that has been built from training dataset at local 13. Memar S, Suriani AL (2013) An integrated semantic-based
level and from a collection of random images retrieved by approach in concept based video retrieval. Multimed Tools Appl
Google Images at global level. The approach is evaluated 64:77–95. 10.1007/s11042-011-0848-4
14. Oquab M, Bottou L, Laptev I, Sivic J (2013) Learning and trans-
for video concept detection task over the TRECVID dataset ferring mid-level image representations using convolutional neural
and compared with state-of-the-art methods, using mean networks. Technical Report HAL-00911179, INRIA
average precision. The experimental results show that the 15. Ma H, Zhu J, Lyu MRT, King I (2010) Bridging the seman-
introduction of a novel classifier using a fusion of asymmet- tic gap between image contents and tags. IEEE Trans Multimed
12(5):462–473
rically trained deep CNNs to deal with dataset imbalance 16. Jia D, Berg A, Fei-Fei L (2011) Hierarchical semantic indexing for
problem improves video concept detection rate, as it is evi- large scale image retrieval. In: Proceedings of the 2011 IEEE con-
dent from improved MAP scores for various Top-d values. ference on computer vision and pattern recognition, CVPR 2011.
This work also proves that, the use of fusion of asymmet- Colorado Springs, pp 785–792
17. Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing
rically trained CNNs and FDCCM together has improved objects by their attributes. In: 2009 IEEE Computer society con-
overall concept detection. The results summarized in Table 10 ference on computer vision and pattern recognition workshops,
shows that the performance of the proposed approach is CVPR Workshops. Miami, pp 1778–1785
18. Bobick A, Davis J (2001) The recognition of human movement 40. Chen X, Chen S, Wu Y (2017) Coverless information hiding
using temporal templates. IEEE Trans Pattern Anal Mach Intell method based on the Chinese character encoding. J Int Technol
23(1):257–267 18(2):91–98. https://doi.org/10.6138/JIT.2017.18.2.20160815
19. Davis JW, Bobick AF (1997) The representation and recognition 41. Tian Q, Chen S (2017) Cross-heterogeneous-database age estima-
of action using temporal templates. In: Proc. IEEE International tion through correlation representation learning. J Neurocomput
conference on computer vision and pattern recognition, pp 928– 238:286–295
934 42. Xue Y, Jiang J, Zhao B, Ma T (2017) A self-adaptive artificial
20. Zelnik ML, Irani M (2006) Statistical analysis of dynamic actions. bee colony algorithm based on global best for global optimization.
IEEE Trans Pattern Anal Mach Intell 28(9):1530–1535 Soft Comput 1–18. https://doi.org/10.1007/s00500-017-2547-1
21. Dong X, Chang SF (2007) Visual event recognition in news 43. Yuan C, Xia Z, Sun X (2017) Coverless image steganog-
video using kernel methods with multi-level temporal alignment. raphy based on SIFT and BOF. J Int Technol 18(2):209–
In: Proc. IEEE international conference on computer vision and 216
pattern recognition. Minneapolis 44. Wei W, Fan X, Song H, Fan X, Yang J (2016) Imperfect infor-
22. Zhou X, Zhuang X, Yan S, Chang SF, Hasegawa-Johnson M, mation dynamic stackelberg game based resource allocation using
Huang TS (2008) Sift-bag kernel for video event analysis. In: hidden Markov for cloud computing. IEEE Trans Services Com-
Proc. ACM international conference on multimedia. Vancouver, put (99) https://doi.org/10.1109/TSC.2016.2528246
pp 229–238 45. Chen Y, Hao C, Wu W, Wu E (2016) Robust dense reconstruction
23. Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classifi- by range merging based on confidence estimation. Sci Chin Inf
cation with deep convolutional neural networks. In: ANIPS, pp 1–8 Sci 59(9):1–11. https://doi.org/10.1007/s11432-015-0957-4
24. LeCun L, Bottou Y, Bengio, Haffner P (1998) Gradient based 46. NIST: http://www.nist.gov
learning applied to document recognition. Proc IEEE 86(5):2278– 47. TRECVID: http://www-nlpir.nist.go
2324
25. Dean G, Corrado R, Monga K, Chen M, Devin Q, Le M, Mao M,
Ranzato A, Senior P, Tucker K, Yang, Ng A (2012) Large scale
distributed deep networks. In: NIPS, pp 1–9
26. Rumelhart D, Hinton G, Williams R (1986) Learning representa-
tions by back-propagating errors. Nature 323(6088):533–536 Nitin J. Janwe has completed
27. Torralba A, Murphy KP, Freeman WT (2004) Contextual models his B.E. from S.G.G.S. Insti-
for object detection using boosted random fields. In: Proc. Adv. tute of Technology, Nanded,
neural inf. process. syst., pp 1401–1408 Maharashtra, India. He did
28. Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S his M.Tech.(CSE) from RTM
(2007) Objects in context. In: Proc. 11th IEEE int. conf. comput. Nagpur University, Nagpur,
vis., pp 1–8 India. Currently he is a PhD
29. Galleguillos C, Rabinovich A, Belongie S (2008) Object catego- student at Department of In-
rization using co-occurrence, location and appearance. In: Proc. formation Technology, Yesh-
IEEE Conf. comput. vis. pattern recog., pp 1–8 wantrao Chavan College of
30. Hwang S, Grauman K (2010) Reading between the lines: object Engineering, Nagpur, Maha-
localization using implicit cues from image tags. In: Proc. IEEE rashtra, India. He is a member
Conf. comput. vis. pattern recog., pp 1145–1158 of CSI & IEEE. His area of
31. Torralba A (2003) Contextual priming for object detection. Int J interest is Image & Video Pro-
Comput Vis 53(2):169–191 cessing and Computer Vision
32. Divvala S, Hoiem D, Hays J, Efros A, Hebert M (2009) An empir- & Machine Learning. He is
ical study of context in object detection. In: Proc. IEEE Conf. working as Associate Professor in Department of Computer Technol-
comput. vis. pattern recog., pp 1271–1278 ogy, Rajiv Gandhi College of Engineering, Research & Technology,
33. Feng L, Bhanu B (2012) Semantic-visual concept relatedness and Chandrapur, Maharashtra, India.
co-occurrences for image retrieval. In: ICIP, pp 2429–2432
34. Wang J, Zhao Y, Wu X, Hua XS (2011) A transductive multi-label
learning approach for video concept detection. Pattern Recogn
44:2274–2286
35. Zha ZJ, Liu Y, Mei T, Hua XS (2007) Video concept detec-
tion using support vector machines - trecvid 2007 evaluations. Kishor K. Bhoyar is cur-
Technical report Microsoft Research Lab – Asia rently working as a Professor
36. Mazloom M, Li X, Snoek CGM (2016) TagBook: a semantic in the Department of Informa-
video representation without supervision for event detection. IEEE tion Technology, Yeshwantrao
Trans Multimed 18(7):1378–1388 Chavan College of Engineer-
37. Markatopoulou F, Mezaris V, Patras I (2015) Cascade of classi- ing, Nagpur, India. He has
fiers based on binary, non-binary and deep convolutional network over 27 years of experience.
descriptors for video concept detection. In: Proc. IEEE Int. conf. He has been awarded PhD
on image processing. Quebec City, pp 1786–1790 degree in Computer Science
38. Markatopoulou F, Mezaris V, Patras I (2016) Deep multi-task and Engineering by Vishwesh-
learning with label correlation constraint for video concept detec- warayya National Institute of
tion. In: Proc. of the ACM multimedia conference. Amsterdam, pp Technology, Nagpur. He is
501–505 a member of ACM,CSI, and
39. Sun Y, Sudo K, Taniguchi Y (2014) TRECVid 2013 semantic IACSIT. His areas of interests
video concept detection by NTT-MD-DUT. In: Proc. of Trecvid are Image & Video Processing
2014 and Machine Learning.

Janwe 2017

Uploaded by

Copyright:

Available Formats

You might also like

Janwe 2017

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Janwe 2017

Uploaded by

Copyright:

Available Formats

Appl Intell

Multi-label semantic concept detection in videos

© Springer Science+Business Media, LLC 2017

Fig. 1 Pipeline of a semantic

yi j d = max1≤i ≤H ,1≤j ≤W Xi +i −1,j +j −1,d . 3.4 Concept detection

Concept set FDCCM

Convolution-1 Convolution-2 Convolution-3 Convolution-4 Convolution-5 Convolution-6 (Concept Detection)

88 x 72 42 x 34 10 x 8 8x6 7x5 3x3 1 x 36

Layer-1 Layer-2 Layer-3 Layer-4 Layer-5 Layer-6 Softmax Layer

Fig. 4 The architecture of CNN used in proposed method

Table 1 CCM along with co-occurrence values

Airplane Animal Car Bus Face Outdoor Sports – – C36

Airplane 1 0 0 0.5 0.6 0.9 0.5 0

SN Concept From To SN Concept From To

1 Airplane 1 24 19 Natural-Disaster 6553 6588

Fig. 5 The CCM generated from TRECVID dataset using Algorithm 2

presents FDCCM created with above approach. An alternate

4.3 Refining concept prediction scores

Fig. 6 The FDCCM derived from CCM using above approach

Table 4 Partition details of TRECVID development dataset 5 Experimental results

5.1 Video datasets

5.1.2 Dataset selection

Table 5 Concept list in the dataset

Id Concept Id Concept Id Concept Id Concept

1 Airplane 10 Crowd 19 Natural-Disaster 28 Snow

5.2 Evaluation measures 1 |D| |Yi Pi |

Fig. 8 Number of positive frames in TRECVID training dataset

Table 6 Concept defining key-frames from the TRECVID development dataset

Concept Concept Definition Examples

Airplane Segment contains a shot of an airplane

Boat Ship Segment contains a shot of a boat or ship

Building Segment contains a shot of an exterior of a building

Car Segment contains a shot of a car

Crowd Segment contains a shot depicting a crowd

Face Segment contains a shot depicting a face

Road Segment contains a shot depicting a road

Sports Segment contains a shot depicting any sport in action

Snow Segment contains a shot depicting snow

Walking Running Segment contains a shot depicting a person walking or running

Id Concept # Id Concept # Id Concept #

MAP MAP MAP

Concept detection using individual CNNs

0.6 0.3 Top-5

Key-frame Id 1524 2846 4600 7434 7787

AP 0.63 0.67 0.25 0.67 0.71

AP 0.50 0.50 0.50 0.33 0.57

AP 0.88 0.83 0.50 0.50 0.71

Sky, Urban, Vegetation, Waterscape Waterfront

AP 0.88 0.67 0.50 0.67 0.57

Fig. 11 Performance 1.2

Method Publication year Database used Classifier MAP (Top-N)

Proposed Method FFDCCM – TRECVID2007 CNNs 0.65

You might also like