Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/288074169

Text-to-video: a semantic search engine for internet videos

Article in International Journal of Multimedia Information Retrieval · March 2016


DOI: 10.1007/s13735-015-0093-0

CITATIONS READS

4 1,316

5 authors, including:

Lu Jiang Deyu Meng


Google Inc. Xi'an Jiaotong University
85 PUBLICATIONS 5,757 CITATIONS 248 PUBLICATIONS 9,785 CITATIONS

SEE PROFILE SEE PROFILE

Teruko Mitamura Alexander G. Hauptmann


Carnegie Mellon University Carnegie Mellon University
184 PUBLICATIONS 2,959 CITATIONS 545 PUBLICATIONS 23,072 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Confident Learning View project

https://github.com/zsyOAOA/BSRDM View project

All content following this page was uploaded by Lu Jiang on 30 June 2020.

The user has requested enhancement of the downloaded file.


Noname manuscript No.
(will be inserted by the editor)

Text-to-Video: A Semantic Search Engine for Internet Videos


Lu Jiang · Shoou-I Yu · Deyu Meng ·
Teruko Mitamura · Alexander G. Hauptmann

Received: date / Accepted: date

Abstract Semantic search or text-to-video search in 1 Introduction


video is a novel and challenging problem in informa-
tion and multimedia retrieval. Existing solutions are The explosion of multimedia data is creating impact-
mainly limited to text-to-text matching, in which the s on many aspects of society. The huge volumes of
query words are matched against the user-generated accumulated video data bring challenges for effective
metadata. This kind of text-to-text search, though sim- multimedia search. Existing solutions, such as shown
ple, is of limited functionality as it provides no under- in YouTube, are mainly limited to text matching where
standing about the video content. This paper presents the query words are matched against the user-generated
a state-of-the-art system for event search without any metadata generated by the uploader [7]. This solution,
user-generated metadata or example videos, known as though simple, proves to be futile when such metadata
text-to-video search. The system relies on substantial are either missing or less relevant to the video content.
video content understanding and allows for searching Content-based search, on the other hand, searches se-
complex events over a large collection of videos. The mantic features such as people, scenes, objects and ac-
proposed text-to-video search can be used to augment tions that are automatically detected in the video con-
the existing text-to-text search for video. The novel- tent. A representative content-based retrieval task, ini-
ty and practicality is demonstrated by the evaluation tiated by the TRECVID community, is called Multime-
in NIST TRECVID 2014, where the proposed system dia Event Detection (MED) [36]. The task is to detect
achieves the best performance. We share our observa- the occurrence of a main event in a video clip without
tions and lessons in building such a state-of-the-art sys- using any user-generated metadata. The events of inter-
tem, which may be instrumental in guiding the design est are mostly daily activities ranging from “birthday
of the future system for video search and analysis. party” to “changing a vehicle tire”.

Keywords Video Search · Video Understanding ·


Text-to-Video · Content-based Retrieval · Multimedia
Event Detection · Zero-Example

Lu Jiang, Shoou-I Yu, Teruko Mitamura, Alexander G.


Hauptmann
School of Computer Science
Carnegie Mellon University
5000 Forbes Avenue, Pittsburgh, USA
E-mail: {lujiang, iyu, teruko, alex}@cs.cmu.edu
Fig. 1 Comparison of text and semantic query for the event
Deyu Meng “birthday party”. Semantic queries contain visual/audio con-
School of Mathematics and Statistics cepts about the video content.
Xi’an Jiaotong University
No.28, Xianning West Road, Xi’an, China
E-mail: dymeng@mail.xjtu.edu.cn This paper discusses a setting in MED called 0Ex
search [18], in which no relevant video is given in the
2 Lu Jiang et al.

query. 0Ex provides a text-to-video search scheme which and recommendation. In summary, the contribution of
relies on substantial video content understanding as op- this paper is twofold:
posed to shallow text matching in conventional meth-
– We share our observations and lessons in building a
ods. The query words are semantic features that are
state-of-the-art 0Ex event search system.
expected to occur in the relevant videos. For exam-
– Our pilot studies provide compelling insights on the
ple, as illustrated in Fig. 1, the semantic query for the
comparison of modality contributions, semantic map-
event “birthday party” might consist of visual concept-
ping methods and retrieval models for event search.
s “cake”, “gift” and “kids”, audio concepts “birthday
song” and “cheering sound”. 0Ex also allows for flexi-
ble semantic search such as temporal or Boolean logic
2 Related Work
search. For example, searching for videos where opening
gifts happens before consuming birthday cakes.
Multimedia event detection is an interesting problem.
This paper details a state-of-the-art 0Ex system called A number of studies have been proposed to tackle this
E-Lamp Lite, and according to National Institute of problem on using several training examples (typical-
Standards and Technology (NIST), the proposed sys- ly 10 or 100 examples) [14,9,43,11,35,19,39,3,41,49].
tem achieves the best performance in TRECVID 2014 Generally, in a state-of-the-art system, the event clas-
on a large collection of 200,000 Internet videos [36]. sifiers are trained by low-level and high-level features,
TRECVID is funded by the NIST with support from and the final decision is derived from the fusion of the
other US government agencies, the goal of which is individual classification results. For example, Habibian
to promote progress via open, metrics-based evalua- et al. [11] found several interesting observations about
tion [36]. The yearly evaluation attracts worldwide par- training classifiers only by semantic concept features.
ticipants in both academia and industry. The evalua- Gkalelis et al. [9] learned a representation for linear
tion in 2014 was conducted on about 200,000 Internet SVMs by subclass discriminant analysis, which yields
videos. The evaluation is rigorous because: 1) each sys- 1-2 orders of magnitude speed-up. Wang et al. [43] dis-
tem can only submit a single run; 2) each run has to cussed a notable system in TRECVID 2012 that is char-
be generated within 1 hour after getting the query; 3) acterized by applying feature selection over so-called
the ground-truth data on the MED14Eval dataset is n- motion relativity features. Oh et al. [35] presented a
ever released even after the evaluation; 4) some queries latent SVM event detector that enables for temporal
are online generated and unknown to the system be- evidence localization. Jiang et al. [19] presented an ef-
forehand. Because of the data size and the rigorous e- ficient method to learn “optimal” spatial event repre-
valuation protocol, its result should reflect a system’s sentations from data.
performance in real-world problems. Our performance Event detection with zero training examples is called
is about three times of the second best system. The 0Ex. It resembles a real-world video search scenario,
outstanding performance is attributed to our rational where users usually start the search without any exam-
pipeline as well as its effective components. We share ple video. 0Ex is an understudied problem, and only few
our observations and lessons in building such a state-of- studies have been proposed very recently [6,10,28,45,
the-art system. The lessons are valuable because of not 18,25,21,20]. Dalton et al. [6] discussed a query expan-
only the effort in designing and conducting numerous sion approach for concept and text retrieval. Habibian
experiments but also the considerable computational et al. [10] proposed to index videos by composite con-
resource to make the experiments possible. For exam- cepts that are trained by combining the labeled data
ple, building the semantic detectors costs us more than of individual concepts. Wu et al. [45] introduced a mul-
1.2 million CPU core hours, which is equivalent to 140 timodal fusion method for semantic concepts and tex-
years if it is running on a single core. We believe the t features. Given a set of tagged videos, Mazloom et
shared lessons may significantly save the time and com- al. [28] discussed a retrieval approach to propagate the
putational cycles for others who are interested in this tags to unlabeled videos for event detection. Jiang et
problem. al. [18,15] studied pseudo relevance feedback approach-
The outstanding performance evaluated by NIST is es which manage to significantly improve the original
a convincing demonstration of the novelty and practi- retrieval results. Existing related works inspire our sys-
cality of the proposed system. Specifically, the novelty tem. However, to the best of our knowledge, there have
of this paper includes the solutions in system design been no studies on the 0Ex system architecture nor the
and the insight on a number of empirical studies on se- analysis of each component.
mantic video search. The discussed techniques may also Another related research topic is called zero-shot
benefit other related tasks such as video summarization learning, the goal of which is to learn a classifier pre-
Text-to-Video: A Semantic Search Engine for Internet Videos 3

dicting novel class that were omitted from the train- The VSIN component extracts semantic features from
ing set [37]. It is essentially a multi-class classification input videos, and indexes them for efficient online search.
problem, where the number of training examples for Typically, a video clip is first represented by low-level
the unseen class is zero. Generally, zero-shot learning features such as dense trajectory features [44] for vi-
and zero-example search are related, and some meth- sual modality or deep learning features [30,31] for au-
ods, such as the word to vector embedding evaluated in dio modality. The low-level features are then input in-
this paper, can be used in both the problems. There- to the off-the-shelf detectors to extract the high-level
fore, some of the techniques presented in this paper can features. Each dimension of the high-level feature cor-
be also used in the zero-shot learning. responds to a confidence score of detecting a seman-
However, there exist a number of differences be- tic concept in the video [14,13]. Compared with low-
tween the two problems. First, zero-shot learning is set level features, high-level features have a much lower
in a multi-class classification setting, i.e. to learn a clas- dimension, which makes them economic for both stor-
sifier that can separate seen and unseen classes. In con- age and computation. The visual/audio concepts, Au-
trast, zero-example search is a detection task, i.e. to tomatic Speech Recognition (ASR) [30] and Optical
find relevant videos from the background videos which Character Recognition (OCR) are four types of high-
might not belong to any class. Second, the granularity level features in the system, in which ASR and OCR
of the class definition is different. The classes in zero- are textual features. ASR provides complementary in-
shot learning are usually visual objects [34]. Some ob- formation for events that are characterized by acoustic
jects may not have any training data because it is hard evidence. It especially benefits close-to-camera and nar-
to collect, and thus, for example, we may use the train- rative events such as “town hall meeting” and “asking
ing examples of “dog” and “horse” to classify “cat”. for directions”. OCR captures the characters in videos
Anyway, it is still possible to collect training data for with low recall but high precision. The recognized char-
the unseen classes as the number of visual classes is acters are often not meaningful words but sometimes
limited. However, the “class” (query) in zero-example can be a clue for fine-grained detection, e.g. distinguish-
search is usually about a complex event that contains ing “baby shower” and “wedding shower”. The union
a number of objects, scenes, people etc., e.g. we may of the dictionary vocabulary of high-level features con-
use “street”, “crowd”, “people marching”, “flag”, and stitutes the system vocabulary.
“walking” to detect the event “parade”. It is impossible
to obtain the training data for every “class” because the Users can express a query in various forms, such as a
user information need can always become more specific, few concept names, a sentence or a structured descrip-
e.g. from “parade” to “outdoor parade”; from “outdoor tion. NIST provides a query in the form of event-kit
parade in a urban area” to “outdoor parade in a urban descriptions, which includes a name, definition, expli-
area where we see a little girl is crying”. Third, the com- cation and visual/acoustic evidences (see the left corner
plexity of the class is different. The class definition in of Fig. 2). The SQG component translates a user query
zero-example learning is more complex as it can contain into a multimodal system query, all words of which ex-
logical and temporal relation of the people, objects, ac- ist in the system vocabulary. Since the vocabulary is
tions, etc. For example, the user may formulate a query usually limited, addressing the out-of-vocabulary issue
like “a baby stopped crying AFTER he saw a dog OR is a major challenge for SQG. The mapping between
a cat”. Note the user explicitly utilizes the Boolean log- the user and system query is usually achieved with the
ic operator “OR” and temporal operator “AFTER” in aid of an ontology such as WordNet and Wikipedia. For
this complex query. example, a user query “golden retriever” may be trans-
lated to its most relevant alternative “large-sized dog”,
as the original concept may not exist in the system vo-
3 Framework cabulary.

Semantic search in video can be modeled as a typical Given a system query, the multimodal search com-
retrieval problem in which given a user query, we are ponent aims at retrieving a ranked list for each modal-
interested in returning a ranked list of relevant videos. ity. As a pilot study, we are interested in leveraging the
The proposed system comprises four major components, well-studied text retrieval models for video retrieval. To
namely Video Semantic INdexing (VSIN), Semantic Query adapt the difference between semantic features and text
Generation (SQG), Multimodal Search, and Pseudo- features, we empirically study a number of classical re-
Relevance Feedback (PRF)/Fusion, where VSIN is an trieval models on various types of semantic features. We
offline indexing component, and the rest are the online then apply the model to its most appropriate modali-
search modules. ties. One notable benefit of doing so is that it can easily
4 Lu Jiang et al.

Fig. 2 Overview of the E-Lamp Lite video search system.

leverage the existing infrastructures and algorithms o- examples of the top detected video snippet. This def-
riginally designed for text retrieval. inition provides a more tangible understanding about
PRF (also known as reranking) refines a ranked list the concept for users.
by reranking its videos. A generic PRF method first The quantity (relevance) and quality of the seman-
selects a few feedback videos, and assign assumed pos- tic detectors are two crucial factors in affecting perfor-
itive or negative labels to them. Since no ground-truth mance. The relevance is measured by the coverage of
label is used, the assumed labels are called pseudo la- the concept vocabulary to the query, and thus is query-
bels. The pseudo samples are then used to build a r- dependent. For convenience, we name it quantity as a
eranking model to improve the original ranked lists. A larger vocabulary tends to increase the coverage. Qual-
recent study shows that reranking can be modeled as ity is evaluated by the accuracy of the detector. Given
a self-paced learning process [15], where the reranking limited resources, there exists a trade-off between quali-
model is built iteratively from easy to more complex ty and quantity, i.e. building many unreliable detectors
samples. The easy samples are the videos ranked at versus building a few reliable detectors. This tradeoff
the top, which are generally more relevant than those has been studied when there are several training exam-
ranked lower. In addition to PRF, the standard late ples [12,11]. This paper presents a novel understanding
fusion is applied in our system. about the tradeoff when there is no training example.
The observations suggest that training more reasonably
accurate detectors tends to be a sensible strategy.
4 System Implementations Training detectors on large-scale video datasets in-
creases both quantity and quality. But it turns out to be
The devil is in the detail. Careful implementations often quite challenging. We approach this challenging prob-
turn out to be the cornerstone in many systems. To lem with effort in two aspects. In theoretical aspect, a
this end, this section discusses each component in more novel and effective method named self-paced curricu-
detail. lum learning [17] is explored. The idea is that, as op-
posed to training a detector using all samples at a time,
we train increasingly complex detectors iteratively on
4.1 Large-scale Semantic Indexing balanced data subsets. The scheme of selecting samples
to be trained in each iteration is controlled by a regular-
Concept detectors can be trained on still images or izer [24,16,17], and can be conveniently replaced to fit
videos. The latter is more desirable due to the minimal various problems. As for practical aspect, the module
domain difference and the capability for action and au- is optimized by storing kernel matrices in large shared-
dio detection. A visual/audio concept in our system is memory machines. This strategy yields 8 times speedup
represented as a multimodal document which includes a in training, enabling us training 3 thousand detectors
name, description, category, reliability (accuracy) and over around 2 million videos (and video segments). In
Text-to-Video: A Semantic Search Engine for Internet Videos 5

practice, in order to index huge volume of video data, stemming) against the concept name or description.
the detectors need to be linear models (linear SVM or Generally, for unambiguous words, it has high precision
logistic regression), and nonlinear models have to be but low recall.
first transformed to the linear models (e.g. by explicit WordNet mapping: This mapping calculates the
feature mapping [42]). similarity between two words in terms of their distance
The ASR module is built on Kaldi [38] by train- in the WordNet taxonomy. The distance can be defined
ing the HMM/GMM acoustic model with speaker adap- in various ways such as structural depths in the hierar-
tive training [30,32,29] on videos. The trigram language chy [46] or shared overlaps between synonymous word-
model is pruned aggressively to speed up decoding. OCR s [1]. WordNet mapping is good at capturing synonyms
is extracted by a commercial toolkit. The high-level fea- and subsumption relations between two nouns.
tures are indexed. The confidence scores for ASR and PMI mapping: The mapping calculates the Point-
OCR are discarded, and the words are indexed by the wise Mutual Information (PMI) [5] between two words.
standard inverted index. The visual/audio concepts are Suppose qi and qj are two words in a user query, we
indexed by dense matrices to preserve their detection have:
scores. P (qi , qj |Cont )
pmi(qi ; qj ) = log , (1)
P (qi |Cont )P (qj |Cont )
4.2 Semantic Query Generation where P (qi |Cont ), P (qj |Cont ) represent the probabili-
ty of observing qi and qj in the ontology Cont (e.g.
SQG translates a user query into a multimodal system a collection of Wikipedia articles), which is calculated
query which only contains words in the system vocab- by the fraction of the document containing the word.
ulary. The input user query is given in the form of the P (qi , qj |Cont ) is the probability of observing the doc-
event-kit description by National Institute of Standards ument in which qi and qj both occur. PMI mapping
and Technology (NIST). Table 1 show the user query assumes that similar words tend to co-occur more fre-
(event kit description) for the event “E011 Making a quently, and is good at capturing frequently co-occurring
sandwich”. Its corresponding system query (with man- concepts (both nouns and verbs).
ual inspection) is shown in Table 2. Indeed, as we see, Word embedding mapping: This mapping learn-
SQG is very challenging because it involves the under- s a word embedding that helps predict the surround-
standing of the description written in natural language. ing words in a sentence [33,26]. The learned embed-
The first step in SQG is to parse negations in the ding, usually by neural network models, is in a lower-
query to recognize counter-examples. The recognized dimensional vector space, and the cosine coefficient be-
examples can be either discarded or to be associated tween two words is often used to measure their dis-
with a “NOT” operator in the system query. Given that tance. It is fast and also able to capture the frequent
TRECVID provides user queries in the form of event- co-occurred words in similar contexts.
kit description (see the left corner of Fig. 2), an event We found two interesting observations in the seman-
can be represented by the event name (1-3 words) or tic query generation. First discriminating concept rele-
the frequent words in the event-kit description (after vance in a query tends to increases the performance. In
removing the template and stop words). These repre- our system, the relevance is categorized into three lev-
sentations can be directly used as system queries for els: “very relevant”, “relevant” and “slightly relevant”,
ASR/OCR as their vocabularies are sufficiently large to and are assigned to weights of 2.0, 1.0 and 0.5, respec-
cover most words. For visual/audio concepts, the repre- tively. See an example in Table 2. Table 3 compares
sentations are used to map the out-of-vocabulary query our full system with and without query weighting. As
words to their most relevant concepts in the system vo- we see, the system with query weighting outperforms
cabulary. This mapping in SQG is challenging because the one without it.
of the complex relation between concepts. The relation
between concepts includes mutual exclusion, subsump- Table 3 Analysis of weighted and logic queries.
tion, and frequent co-occurrence. For example, “cloud” Runs
MED13Test MED14Test
Single 10 Splits Single 10 Splits
and “sky” are frequently co-occurring concepts; “dog” Query Weight 20.60 18.77±2.16 20.75 19.47±1.19
subsumes “terrier”; and “blank frame” excludes “dog”. NoWeight 18.8 18.30±2.20 20.27 19.35±1.26
E029 Logic 24.50 18.47±3.14 24.50 18.47±3.14
Our system includes the following mapping algorithms E029 NoLogic 12.57 11.89±2.05 12.57 11.89±2.05
to map a word in the user query to the concept in the
system vocabulary:
Exact word matching: A straightforward map- Second, the Boolean logic query tends to improve
ping is matching the exact query word (usually after the performance for certain queries. For example, in the
6 Lu Jiang et al.

event “E029: Winning a race without a vehicle”, the equals:


query including only relevant concepts such as swim-
ming, racing or marathon can achieve a MAP of 12.57. tf (qi , d)
P (qi |d) = λ P + (1 − λ)P (qi |C), (4)
However, the query also containing “AND NOT” con- w tf (w, d)
cepts such as car racing, horse riding or bicycling can where w enumerates all words or concepts in a given
achieve a MAP of 24.50. This observation may suggest video, and P (qi |C) is called a smoother that can be
formulating logic queries seems to be a promising direc- calculated by df (qi )/|C|. Eq. (4) linearly interpolates
tion, under the circumstance that the concept detectors the maximum likelihood estimation (first term) with
are reasonably accurate. the collection model (second term) by a coefficient λ.
The parameter is usually tuned in the range of [0.7, 0.9].
This model is good for retrieving long text queries, e.g.
4.3 Multimodal Search
the frequent words in the event kit description.
Given a system query, the multimodal search compo- Language Model-Dirichlet Smoothing (LM-
nent aims at retrieving a ranked list for each modality. DL): This model adds a conjugate prior to the language
We are interested in leveraging the well-studied text model:
retrieval models for video retrieval. There is no single tf (qi , d) + µP (qi |C)
retrieval model that can work the best for all modali- P (qi |d) = P , (5)
w tf (w, d) + µ
ties. As a result, our system incorporates several clas-
sical retrieval models and applies them to their most where µ is the coefficient balancing the likelihood mod-
appropriate modalities. Let Q = q1 , . . . , qn denote a el and the conjugate prior, and is usually tuned in
system query. A retrieval model ranks videos by the s- [0, 2000] [51]. This model is good for short text queries,
core s(d|Q), where d is a video in the video collection e.g. the event name representation.
C. Our system includes the following retrieval models:
Vector Space Model (VSM): This model repre-
sents both a video and a query as a vector of the words 4.4 Pseudo Relevance Feedback
in the system vocabulary. The common vector repre-
sentation includes generic term frequency (tf) and ter- PRF (or reranking) is a cost-effective method in im-
m frequency-inverse document frequency (tf-idf) [47]. proving performance. Studies have shown that it might
s(d|Q) derives from either the product or the cosine hurt a few queries but generally improves the overall
coefficient between the video and the query vector. performance across all queries. We incorporate a gen-
Okapi BM25: This model extends tf-idf represen- eral multimodal PRF method called SPaR [15] in our
tation: system.
Xn Self-Paced Reranking (SPaR): This method is
|C| − df (qi )+ 12 tf (qi, d)(k1 +1)
s(d|Q) = log , inspired by the cognitive and learning process of hu-
df (qi )+ 12 tf (qi , d)+k1 (1−b+b len(d) )
i=1 len mans and animals that gradually learn from easy to
(2) more complex samples [2,24,17,52]. The easy samples
in this problem are the videos ranked at the top, which
where |C| is the total number of videos. df (·) returns the
are generally more relevant than those ranked lower. S-
document frequency for a given word in the collection;
PaR can take the input of either a single fused ranked
tf (qi , d) returns the raw term frequency for the word
list or a number of ranked lists from each of the modal-
qi in the video d. len(d) calculates the sum of concept
ities.1 For convenience of notation, we mainly discuss
or word detection scores in the video d, and len is the
the case of a single ranked list. Let Θ denote the rerank-
average video length in the collection. k1 and b are two
ing model. xi denotes the feature of the ith video and
parameters to tune [27]. In the experiments, we set b =
yi for the ith pseudo label. Note that xi could be both
0.75, and tune k1 in [1.2, 2.0].
high-level and low-level features [18]. We have:
Language Model-JM Smoothing (LM-JM): The
n
score is considered to be generated by a unigram lan- X
min vi L(xi , yi ; Θ) + f (v; λ)
guage model [51]: Θ,y,v (6)
i=1
n
X s.t. Constraints on Θ, y ∈ {−1, +1}n, v ∈ [0, 1]n ,
s(d|Q) = log P (d|Q) ∝ log P (d) + log P (qi |d), (3)
1
i=1 When different retrieval models are used for different fea-
tures, and their ranked lists have very different score distri-
where P (d) is usually assumed to be following the uni- butions, we may need to solve a linear programming problem
form distribution, i.e. the same for every video. P (qi |d) to determine the starting pseudo positives [18].
Text-to-Video: A Semantic Search Engine for Internet Videos 7

where v = [v1 , . . . , vn ] are latent weight variables for may not be useful. Second, SPaR may not help when
each sample; L is the loss function in Θ; f is self-paced the features used in PRF are not discriminative.
function which determines the learning scheme, i.e. how
to select and weight samples. λ is a parameter control-
ling the model age. When λ is small, a few easy samples 5 Experiments
with smaller loss will be considered. As λ grows, more
5.1 Setups
samples with larger loss will be gradually appended to
train a mature model.
Dataset and evaluation: The experiments are con-
To run PRF in practice, we first need to pick a r- ducted on the TRECVID Multimedia Event Detection
eranking model Θ (e.g. SVM or regression model), a (MED) MED13Test and MED14Test, evaluated by the
self-paced function (e.g. binary, linear or mixture weight- official metric Mean Average Precision (MAP). Each set
ing) [15,17], and reasonable starting values for the pseu- includes 20 events and 25,000 testing videos. The offi-
do labels. The starting values can be initialized either cial test split released by NIST is used, and the reported
by top ranked videos in the retrieved ranked lists or by MAP is comparable with others on the same split. The
other PRF methods. After the initialization, we iterate experiments are conducted without using any exam-
the following three steps: 1) training a model based on ples. We also evaluate each experiment on 10 randomly
the selected pseudo samples and their weights (fixing generated splits to reduce the bias brought by the split
v, y, optimize Θ); 2) calculating pseudo positive sam- partition. The mean and 90% confidence interval are
ples and their weights by the self-paced function f , and reported. Besides, the official results on MED14Eval e-
selecting some pseudo negative samples randomly (fix- valuated by NIST TRECVID is also reported in the
ing Θ, optimize v, y)2 ; the weights of pseudo positive performance overview.
samples may be directly calculated by the closed-form Features and queries: High-level features are Au-
solutions of f [15]; 3) increasing the model age to in- tomatic Speech Recognition (ASR), Optical Charac-
clude more positive samples in the next iteration (in- ter Recognition (OCR), and visual semantic concepts.
crease λ). The value of λ is increased not by the absolute ImageNet features are trained on still images by deep
value but by the number of positive samples to be in- convolution neural networks [23]. The rest are directly
cluded. For example, if 5 samples need to be included, trained on videos by the SVM-based self-paced learning
we can set λ by the 6th smallest loss so that the samples pipeline [16,17]. The video datasets include: Sports [22],
whose loss are greater than λ will have 0 weight. In our Yahoo Flickr Creative Common (YFCC) [40], Internet
system, MMPRF [18] and SPaR [15] are incorporated, Archive Creative Common (IACC) [36] and Do it Y-
in which MMPRF is used to assign the starting values, ourself (DIY) [48]. The details of these datasets can be
and SPaR is used as the core algorithm. Average fusion found in Table 4. In total, 3,043 video-based concept de-
of the PRF result with the original ranked list is used tectors are trained using the improved dense trajectory
to obtain better results. features [44], which costs more than 1.2 million CPU
PRF is a cost-effective method in improving the per- core hours (1,024 cores for about 2 months) in Pitts-
formance of event search. We observed that the preci- burgh Super-computing Center. Once trained, the de-
sion of the pseudo positives determines the performance tection (semantic indexing) for test videos is very fast.
of PRF (see Fig.5 in [18]). Therefore, high precision fea- The features are available at the page
tures such as OCR and ASR often turn out to be very http://www.cs.cmu.edu/~ lujiang/0Ex/icmr15.html .
useful. In addition, soft weighting self-paced functions Two types of low-level features are used: dense tra-
that discriminate samples at different ranked positions jectories [44] and MFCC [50] in the PRF model. The in-
tend to be better than binary weighting functions [15]. put user query is the event-kit description. The system
In this case, the model should support sample weight- query is obtained by a two-step procedure: a prelim-
ing (e.g. Libsvm tools-weights [4]). We observed two inary mapping is automatically generated by the dis-
scenarios where the discussed reranking method could cussed mapping algorithms. The results are then exam-
fail. First, SPaR may not help when the accuracy of the ined by human experts to figure out the final system
starting pseudo positive samples is below some thresh- query. See Table 1 and Table 2 for the example of user
old (e.g. the precision of pseudo positive is less than and system queries. For ASR/OCR, the automatically
0.1). This may be due to less relevant queries or poor generated event name and description representation
quality of the high-level features. In this case, SPaR are used as the system query. Note that manual query
examination is allowed in TRECVID and is used by
2
The reason we can randomly select pseudo negative sam- many teams [36]. The automatically generated queries
ples is that they have negligible impacts on performance [18]. will be discussed in Section 5.5.
8 Lu Jiang et al.

System Configurations: In the multimodal search 5 samples have the loss 0. This happens because the
component, the LM-JM model (λ = 0.7) is used for model is not well calibrated. In this case, we can calcu-
ASR/OCR for the frequent-words in the event-kit de- late the loss using another loss function that is an upper
scription. BM25 is used for ASR [32] and OCR features bound of the hinge loss. While minimizing the upper
for the event name query (1-3 words), where k1 = 1.2 bound, we are also minimizing the hinge loss. Another
and b = 0.75. Both the frequent-words query and the practical approach to address this issue is that we tune
event name query are automatically generated without the parameter b in the loss function so that only one
manual inspection. While parsing the frequent words in sample has the loss 0. Note that changing interpolation
the event-kit description, the stop and template word- parameter b changes the constant adding to prediction
s are first removed, and words in the evidence section score of each sample, and does not change the rank of
are counted three times. After parsing, the words with the prediction score.
the frequency ≥ 3 are then used in the query. VSM-tf The PRF results by the improved dense trajectory
model is applied to all semantic concept features. and the MFCCs is first averaged. Then the results is av-
In the SQG component, the exact word matching eraged with the original ranked list. Since the test set
algorithm finds the concept name in the frequent event- only has around 20 relevant videos in MED14Test, only
kit words (frequency ≥ 3). The WordNet mapping us- a single iteration PRF is conducted. For the MED14Eval
es the distance metrics in [46] as the default metric. dataset (200K videos) in our TRECVID 2014 submis-
We build an inverted index over the Wikipedia corpus sion, we conduct two iterations of PRF. Generally, we
(about 6 million articles), and use it to calculate the P- found that the performance starts getting worse after
MI mapping. A pre-trained word embedding trained on three iterations [15,18].
Wikipedia [26] is used to calculate the Word embedding
mapping.
In the PRF component, the SVM model is selected 5.2 Performance Overview
as the reranking model. The self-paced function used is
the mixture weighting: We first examine the overall MAP (×100) of the full
n system. Table 5 lists the MAP and the runtime (m-
s/query)3 across six datasets. The average MAP and
X
f (v; k1 , k2 ) = −ζ log(vi + ζk1 ), (7)
i=1 the 90% confidence interval on the 10-splits are report-
1 ed on MED13Test and MED14Test. The ground-truth
where ζ = k2 −k and k2 > k1 > 0. Its closed-form
1 data on MED14Eval has never been released, and thus
optimal solution is then written as:
the MAP evaluated by NIST TRECVID is only avail-
ℓi ≤ k12 able on the single split. We observed that the results

1

on a single split can be deceiving as no training data
vi∗ = 0 ℓi ≥ k11 , (8)
ζ

1 1
makes the result prone to overfitting. Therefore, we e-
ℓi − k1 ζ k2 < ℓi < k1 valuate the MAP also on the ten splits. We observed
where ℓi is the loss for the ith sample in the fusion of the that an improvement is statistically significant if it can
ranked lists from multiple modalities. If we rank a list be observed in the both cases.
by the sample loss in increasing order, 1/k1 is set using Our system achieves the best performance across all
the loss of the top 6th sample, and 1/k2 is set as the of the datasets. According to [36], for example, Fig. 3
loss of the top 3th sample. In other words, the top 1-3 illustrates the comparison on the largest set of around
samples in the list will have 1.0 weights because their 200,000 Internet videos, where the x-axis lists the sys-
loss ℓi ≤ 1/k2 ; the top 3-5 samples will have weight tem of each team, and the y-axis denotes the MAP
ζ evaluated by NIST TRECVID. The red bars represent
ℓi − k1 ζ; the top 6th sample will have 0 weights (as its
loss ℓi = 1/k1 ), and so does those ranked after it. In our system with and without PRF. As we see, our sys-
summary only the top 5 samples have non-zero weights tem achieves the best performance which is about three
in PRF, and they are used as pseudo positive samples. times of the second best system. The improvement of
As discovered in [18] that the pseudo negative samples PRF is evident as it is the only difference between the
have negligible impact on performance. Therefore, we two red bars in Fig. 3. It is worth noting that the e-
randomly select a number of pseudo negative samples valuation is very rigid because each system can only
that is proportional to the number of selected pseudo submit a single run within 60 minutes after getting the
positive samples in each iteration. 3
The time is the search time evaluated on an Intel Xeon
When calculating the hinge loss in selecting the pseu- 2.53GHz CPU. It does not include the PRF time, which is
do positive samples, it might be the case that more than about 238ms over 200K videos.
Text-to-Video: A Semantic Search Engine for Internet Videos 9

query, and the ground-truth data is not released even Table 6 MAP (× 100) comparison with the published results
after the submission. In addition, for the ad-hoc events on MED13Test.
(see Fig. 3(b)) the query are generated online and un- Method Year MAP
known to the system beforehand. Since blind queries SIN/DCNN (visual) [18] 2014 2.5
Composite Concepts [10] 2014 6.4
are more challenging, all systems perform worse on ad- Tag Propagation [28] 2014 9.6
hoc events. Nevertheless, our system still achieves an MMPRF [18] 2014 10.1
outstanding MAP. Clauses [25] 2014 11.2
Multi-modal Fusion [45] 2014 12.6
Table 5 Overview of the system performance. SPaR [15] 2014 12.9
Our AutoSQGSys 2015 12.0
MAP (×100) time
Dataset #Videos Our VisualSys 2015 18.3
1-split 10-splits (ms)
Our FullSys 2015 20.7
MED13Test 25K 20.75 19.47±1.19 80
MED14Test 25K 20.60 17.27±1.82 112 Table 7 MAP (× 100) comparison with the published results
MED14EvalSub-PS 32K 24.1 - 192 on MED14Test.
MED14EvalSub-AH 32K 22.1 - 200
MED14Eval-PS 200K 18.1 - 1120 Method Year MAP
MED14Eval-AH 200K 12.2 - 880 SPCL [17] 2015 9.2
Our AutoSQGSys 2015 11.5
0.2
PRF Our VisualSys 2015 17.6
Our Systems Our FullSys 2015 20.6
noPRF
0.15 Other Systems

where the ground-truth data is available. As we see, vi-


MAP

0.1
sual modality is the most contributing modality, which
0.05 by itself can recover about 85% MAP of the full system.
ASR and OCR provide contribution to the full system
0
(a) Pre-Specified (PS) but prove to be much worse than the visual features.
0.2
PRF is beneficial in improving the performance of the
Our Systems full system.
0.15 Other Systems To understand the feature contribution, we conduc-
PRF
t leave-one-feature-out experiments. The performance
MAP

0.1
noPRF
drop, after removing the feature, can be used to esti-
0.05 mate its contribution to the full system. As we see in
Table 9, the results show that every feature provides
0
some contribution. As the feature contribution is main-
(b) Ad-Hoc (AH)
ly dominated by a number of discriminative events, the
Fig. 3 The official results released by NIST TRECVID 2014 comparison is more meaningful at the event-level (see
on MED14Eval (200, 000 videos). Table 13, Table 14 and Table 15), where one can tell
that, for example, the contribution of ImageNet mainly
Table 6 and Table 7 list the published results on comes from three events E031, E015 and E037. Though
zero-example multimedia event detection. These result- the MAP drop varies on different events and datasets,
s are all conducted on the NIST’s split on MED13Test the average drop on the two datasets follows: Sports >
and MED14Test, and thus are comparable to each oth- ImageNet > ASR > IACC > YFCC > DIY > OCR.
er. The last three rows: AutoSQGSys, VisualSys and The biggest contributor Sports is also the most com-
FullSys are systems proposed in this paper, where Au- putationally expensive feature. In fact, the above order
toSQGSys uses the automatically generated concept of semantic concepts highly correlates to #samples in
mapping; VisualSys uses only visual features; FullSys their datasets, which suggests the rationale of training
uses all features but no PRF. The methods are ranked concepts over big data sets.
according to their MAPs. Table 8 Comparison of modality contribution.
MED13Test MED14Test
Run
1-split 10-splits 1-split 10-splits
FullSys+PRF 22.12 - 22.05 -
5.3 Modality/Feature Contribution FullSys 20.75 19.47±1.19 20.60 18.77±2.16
VisualSys 18.31 18.30±1.11 17.58 17.27±1.82
Table 8 compares the modality contribution, where each ASRSys 6.53 6.90±0.74 5.79 4.26±1.19
run represents a system with a certain configuration. OCRSys 2.04 4.14±0.07 1.47 2.20±0.73
The MAP is evaluated on MED14Test and MED13Test,
10 Lu Jiang et al.

5.4 Quantity (Relevance) & Quality Tradeoff calculating the mapping between 1,000 pairs of word-
s. The second last row (Fusion) indicates the average
To study the concept quantity (relevance) and quality fusion of the results of all mapping algorithms. As we
trade-off, we conduct the following experiments where see, in terms of P@5, PMI is slightly better than others,
the IACC and ImageNet concepts are gradually added but it is also the slowest because its calculation involves
to the query, one at a time, according to their rele- looking up a index of 6 million articles in Wikipedia.
vance. The relevance judgment is manually annotated Fusion of all mapping results yields a better P@5.
by a group of assessors on 20 events. Fig. 4 illustrates We then combine the automatically mapped seman-
the results on a representative event, where the x-axis tic concepts with the automatically generated ASR and
denotes the number of concepts included, and the y- OCR query to test the MAP. Here we assume user-
axis is the average precision. Each curve represents a s have specified which feature to use for each query,
trial of only selecting concepts of certain quality. The and SQG is used to automatically find relevant con-
quality is manually evaluated by the precision of the cepts or words in the specified features. Our result can
top 10 detected examples on a third dataset. For ex- be understood as an overestimate of a fully-automatic
ample, the red curve indicates selecting only the rel- SQG system, in which users do not even need to specify
evant concepts whose precision of the top 10 detected the feature. As we see in Table 10, PMI performs the
examples is greater than 0.5. The blue curve marked by best on MED13Test whereas on MED14Test it is Ex-
circles (p@10 > 0) is a trial of the largest vocabulary act Word Matching. The fusion of all mapping results
but on average the least accurate concepts. In contrast, (the second last row) improves the MAP on both the
the dashed curve is of the smallest vocabulary but the datasets. We then fine-tune the parameters of the map-
most accurate concepts, and thus for a query, there are ping fusion and build our AutoSQG system (the last
fewer relevant concepts to choose from. As we see, nei- row). As we see, AutoSQG only achieves about 55% of
ther of the settings is optimal. Selecting concepts with the full system’s MAP. Several reasons account for the
a reasonable precision, between 0.2 and 0.5 in this case, performance drop: 1) the concept name does not accu-
seems to be a better choice. The results suggest that rately describe what is being detected; 2) the quality
incorporating more relevant concepts with reasonable of mapping is limited (P@5=0.42); 3) relevant concept-
quality is a sensible strategy. We also observed that s may not necessarily be discriminative. For example,
merely increasing the number of low-quality concepts “animal” and “throwing ball” appear to be relevant to
may not be helpful. the query “playing a fetch”, but the former is too gen-
eral and the latter is about throwing a baseball which
0.12
p@10 > 0
is visually different; “dog” is much less discriminative
0.1 p@10 > 0.2 than “group of dogs” for the query “dog show”. The
p@10 > 0.5
Average Precision

0.08 p@10 > 0.8 results suggest that the automatic SQG is not well-
0.06
understood. The proposed automatic mappings are still
very preliminary, and could be further refined by man-
0.04
ual inspection. We found it is beneficial to represent
0.02
a concept as a multimodal document that includes a
0
5 10 15 20 25
#Concepts Included in Query
30 35 40 name, description, category, reliability (accuracy) and
examples of the top detected video snippet.
Fig. 4 Adding relevant concepts in “E037 Parking a vehicle”
using concepts of different precisions.
Table 10 Comparison of SQG mapping algorithms.
MAP
Mapping Method P@5 Time (s)
13Test 14Test
Exact Word Matching 0.340 9.66 7.22 0.10
5.5 Semantic Matching in SQG WordNet 0.330 7.86 6.68 1.22
PMI 0.355 9.84 6.95 22.20
Word Embedding 0.335 8.79 6.21 0.48
We apply the SQG algorithms to map the user query Mapping Fusion 0.420 10.22 9.38 -
to the concept in the vocabulary. We use two metrics to AutoSQGSys - 12.00 11.45 -
compare these mapping algorithms. One is the precision
of the 5 most relevant concepts returned by each algo-
rithm. We manually assess the relevance for 10 events 5.6 Comparison of Retrieval Methods
(E031-E040) on 4 concept features (i.e. 200 pairs for
each mapping algorithm). The other is the MAP ob- Table 11 compares the retrieval models on MED14Test
tained by the 3 most relevant concepts. Table 10 lists using four features: ASR, OCR and two types of vi-
the results, where the last column lists the runtime of sual concepts. As we see, there is no single retrieval
Text-to-Video: A Semantic Search Engine for Internet Videos 11

model that works the best for all features. For ASR curacy seems to be a sensible strategy. Merely in-
and OCR words, BM25 and Language Model with JM creasing the number of low-quality concepts may
smoothing (LM-JM) yield the best MAPs. An inter- not improve performance.
esting observation is that VSM can only achieve 50% – Recommendation 2: PRF (or reranking) is an ef-
MAP of LM-JM on ASR (2.94 versus 5.79). This ob- fective approach to improve the search result.
servation suggests that the role of retrieval models in – Recommendation 3: Retrieval models may have
video search is substantial. For semantic concepts, VSM substantial impacts to the search result. A reason-
performs no worse than other models. We hypothesize able strategy is to incorporate multiple models and
that it is because the concept representation is dense, apply them to their appropriate features/modalities.
i.e. every dimension has a nonzero value, and thus is – Recommendation 4: Automatic query generation
quite different from sparse text features. To verify this for queries in the form of event-kit descriptions is
hypothesis, we sparsify the Sports representation using still very challenging. Combining mapping result-
a basic approach, where we only preserve the top m di- s from various mapping algorithms and applying
mensions in a video, and m is set proportional to the manual examination afterward is the best strategy
concept vocabulary size. As we see, the sparse feature known so far.
with 1% nonzero elements can recover 85% of its dense This paper is merely a first effort towards semantic
representation. Besides, BM25 and LM exhibit better search in Internet videos. The proposed system can be
MAPs in the sparse representation. Since the dense rep- improved in various ways, e.g. by incorporating more
resentation is difficult to index and search, the results accurate visual and audio concept detectors, by study-
suggest a promising direction for large-scale search us- ing more appropriate retrieval models, by exploring search
ing the sparse semantic feature. interfaces or interactive search schemes. As shown in
Table 11 Comparison of retrieval models on MED14Test our experiments, the automatic semantic query gener-
using ASR, OCR, Sports and IACC. ation is not well understood. Closing the gap between
Feat. Split VSM-tf VSM-tfidf BM25 LM-JM LM-DP the manual and automatic query may point to a promis-
1 2.94 1.26 3.43 5.79 1.45
ASR
10 2.67 1.49 3.03 4.26 1.14
ing direction. Another interesting direction is to apply
OCR
1 0.56 0.47 1.47 1.02 1.22 the techniques discussed in this paper to the web-scale
10 2.50 2.38 4.52 3.80 4.07
1 9.21 8.97 8.83 8.75 7.57 video data.
Sports
10 10.61 10.58 10.13 10.25 9.04 Acknowledgements This work was partially supported by
1 3.49 3.52 2.44 2.96 2.06
IACC the Intelligence Advanced Research Projects Activity (IARPA)
10 2.88 2.77 2.05 2.45 2.08
via Department of Interior National Business Center con-
tract number D11PC20068. Deyu Meng was partially sup-
Table 12 Study of retrieval performance using sparse con- ported by the China NSFC project under contract 61373114.
cept features (Sports) on MED14Test. The U.S. Government is authorized to reproduce and dis-
tribute reprints for Governmental purposes notwithstanding
Density VSM-tf BM25 LM-JM LM-DP any copyright annotation thereon. Disclaimer: The views and
1% 9.06 9.58 9.09 9.38 conclusions contained herein are those of the authors and
2% 9.93 10.12 10.14 10.07 should not be interpreted as necessarily representing the offi-
4% 10.34 10.36 10.26 10.38 cial policies or endorsements, either expressed or implied, of
16% 10.60 10.45 10.03 9.89 IARPA, DoI/NBC, or the U.S. Government.
100% 10.61 10.13 10.25 9.04 This work used the Extreme Science and Engineering Dis-
covery Environment (XSEDE), which is supported by Nation-
al Science Foundation grant number OCI-1053575. It used the
6 Conclusions and Future Work Blacklight system at the Pittsburgh Supercomputing Center
(PSC).
We proposed a state-of-the-art semantic video search References
engine called E-Lamp Lite. The proposed system goes
beyond conventional text matching approaches, and al- 1. Banerjee, S., Pedersen, T.: An adapted lesk algorithm for
lows for semantic search (text-to-video search) without word sense disambiguation using wordnet. In: CICLing
(2002)
any user-generated metadata or example videos. We 2. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Cur-
shared our lessons on system design and compelling riculum learning. In: ICML (2009)
insights on a number of empirical studies. From the 3. Bhattacharya, S., Yu, F.X., Chang, S.F.: Minimally need-
experimental results, we arrive at the following recom- ed evidence for complex event recognition in uncon-
strained videos. In: ICMR (2014)
mendations. 4. Chang, C.C., Lin, C.J.: LIBSVM: A library for support
vector machines. ACM TIST 2, 27:1–27:27 (2011)
– Recommendation 1: Training concept detectors
5. Church, K.W., Hanks, P.: Word association norms, mu-
on big data sets is ideal. However, given limited re- tual information, and lexicography. Computational lin-
sources, building more detectors of reasonable ac- guistics 16(1), 22–29 (1990)
12 Lu Jiang et al.

6. Dalton, J., Allan, J., Mirajkar, P.: Zero-shot video re- 30. Miao, Y., Jiang, L., Zhang, H., Metze, F.: Improvements
trieval using content and concepts. In: CIKM (2013) to speaker adaptive training of deep neural networks. In:
7. Davidson, J., Liebald, B., Liu, J., et al.: The youtube SLT (2014)
video recommendation system. In: RecSys (2010) 31. Miao, Y., Metze, F.: Improving low-resource cd-dnn-
8. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, hmm using dropout and multilingual dnn training. In:
L.: Imagenet: A large-scale hierarchical image database. INTERSPEECH (2013)
In: CVPR (2009) 32. Miao, Y., Metze, F., Rawat, S.: Deep maxout networks
9. Gkalelis, N., Mezaris, V.: Video event detection using for low-resource speech recognition. In: ASRU (2013)
generalized subclass discriminant analysis and linear sup- 33. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean,
port vector machines. In: ICMR (2014) J.: Distributed representations of words and phrases and
10. Habibian, A., Mensink, T., Snoek, C.G.: Composite con- their compositionality. In: NIPS (2013)
cept discovery for zero-shot video event detection. In: 34. Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens,
ICMR (2014) J., Frome, A., Corrado, G.S., Dean, J.: Zero-shot learning
11. Habibian, A., van de Sande, K.E., Snoek, C.G.: Recom- by convex combination of semantic embeddings. ICLR
mendations for video event recognition using concept vo- (2014)
cabularies. In: ICMR (2013) 35. Oh, S., McCloskey, S., Kim, I., Vahdat, A., Cannons,
12. Hauptmann, A., Yan, R., Lin, W.H., Christel, M., Wact- K.J., Hajimirsadeghi, H., Mori, G., Perera, A.A., Pandey,
lar, H.: Can high-level concepts fill the semantic gap in M., Corso, J.J.: Multimedia event detection with multi-
video retrieval? a case study with broadcast news. TMM modal feature fusion and temporal concept localization.
9(5), 958–966 (2007) Machine vision and applications 25(1), 49–69 (2014)
13. Inoue, N., Shinoda, K.: n-gram models for video semantic 36. Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G.,
indexing. In: MM (2014) Kraaij, W., Smeaton, A.F., Quenot, G.: TRECVID 2014
14. Jiang, L., Hauptmann, A., Xiang, G.: Leveraging high- – an overview of the goals, tasks, data, evaluation mech-
level and low-level features for multimedia event detec- anisms and metrics. In: TRECVID (2014)
37. Palatucci, M., Pomerleau, D., Hinton, G.E., Mitchell,
tion. In: MM (2012)
15. Jiang, L., Meng, D., Mitamura, T., Hauptmann, A.G.: T.M.: Zero-shot learning with semantic output codes. In:
Easy samples first: Self-paced reranking for zero-example NIPS (2009)
38. Povey, D., Ghoshal, A., Boulianne, G., et al.: The kaldi
multimedia search. In: MM (2014)
speech recognition toolkit. In: ASRU (2011)
16. Jiang, L., Meng, D., Yu, S.I., Lan, Z., Shan, S., Haupt-
39. Safadi, B., Sahuguet, M., Huet, B.: When textual and
mann, A.G.: Self-paced learning with diversity. In: NIPS
visual information join forces for multimedia retrieval.
(2014)
In: ICMR (2014)
17. Jiang, L., Meng, D., Zhao, Q., Shan, S., Hauptmann, 40. Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B.,
A.G.: Self-paced curriculum learning. In: AAAI (2015) Ni, K., Poland, D., Borth, D., Li, L.J.: The new data and
18. Jiang, L., Mitamura, T., Yu, S.I., Hauptmann, A.G.: new challenges in multimedia research. arXiv preprint
Zero-example event search using multimodal pseudo rel- arXiv:1503.01817 (2015)
evance feedback. In: ICMR (2014) 41. Tong, W., Yang, Y., Jiang, L., et al.: E-lamp: integra-
19. Jiang, L., Tong, W., Meng, D., Hauptmann, A.G.: To- tion of innovative ideas for multimedia event detection.
wards efficient learning of optimal spatial bag-of-words Machine Vision and Applications 25(1), 5–15 (2014)
representations. In: ICMR (2014) 42. Vedaldi, A., Zisserman, A.: Efficient additive kernels via
20. Jiang, L., Yu, S.I., Meng, D., Mitamura, T., Hauptmann, explicit feature maps. PAMI 34(3), 480–492 (2012)
A.G.: Bridging the ultimate semantic gap: A semantic 43. Wang, F., Sun, Z., Jiang, Y., Ngo, C.: Video event de-
search engine for internet videos. In: ICMR (2015) tection using motion relativity and feature selection. In:
21. Jiang, L., Yu, S.I., Meng, D., Yang, Y., Mitamura, T., TMM (2013)
Hauptmann, A.G.: Fast and accurate content-based se- 44. Wang, H., Schmid, C.: Action recognition with improved
mantic search in 100m internet videos. In: MM (2015) trajectories. In: ICCV (2013)
22. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Suk- 45. Wu, S., Bondugula, S., Luisier, F., Zhuang, X., Natara-
thankar, R., Fei-Fei, L.: Large-scale video classification jan, P.: Zero-shot event detection using multi-modal fu-
with convolutional neural networks. In: CVPR (2014) sion of weakly supervised concepts. In: CVPR (2014)
23. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet 46. Wu, Z., Palmer, M.: Verbs semantics and lexical selection.
classification with deep convolutional neural networks. In: ACL (1994)
In: Advances in neural information processing systems, 47. Younessian, E., Mitamura, T., Hauptmann, A.: Multi-
pp. 1097–1105 (2012) modal knowledge-based analysis in multimedia event de-
24. Kumar, M., Packer, B., Koller, D.: Self-paced learning for tection. In: ICMR (2012)
latent variable models. In: NIPS (2010) 48. Yu, S.I., Jiang, L., Hauptmann, A.: Instructional videos
25. Lee, H.: Analyzing complex events and human actions for unsupervised harvesting and learning of action exam-
in” in-the-wild” videos. In: UMD Ph.D Theses and Dis- ples. In: MM (2014)
sertations (2014) 49. Yu, S.I., Jiang, L., Xu, Z., Yang, Y., Hauptmann, A.G.:
26. Levy, O., Goldberg, Y.: Dependency-based word embed- Content-based video search over 1 million videos with 1
dings. In: ACL (2014) core in 1 second. In: ICMR (2015)
27. Manning, C.D., Raghavan, P., Schütze, H.: Introduction 50. Yu, S.I., Jiang, L., Xu, Z., et al.: Cmu-informedia@trecvid
to information retrieval, vol. 1. Cambridge university 2014. In: TRECVID (2014)
press Cambridge (2008) 51. Zhai, C., Lafferty, J.: A study of smoothing methods for
28. Mazloom, M., Li, X., Snoek, C.G.: Few-example video language models applied to information retrieval. TOIS
event retrieval using tag propagation. In: ICMR (2014) 22(2) (2004)
29. Miao, Y., Gowayyed, M., Metze, F.: Eesen: End-to-end 52. Zhao, Q., Meng, D., Jiang, L., Xie, Q., Xu, Z., Haupt-
speech recognition using deep rnn models and wfst-based mann, A.G.: Self-paced learning for matrix factorization.
decoding. arXiv preprint arXiv:1507.08240 (2015) In: AAAI (2015)
Text-to-Video: A Semantic Search Engine for Internet Videos 13

Table 1 User query (event-kit description) for the event “Making a sandwich”.
Event name Making a sandwich
Constructing an edible food item from ingredients, often including
Definition
one or more slices of bread plus fillings
Sandwiches are generally made by placing food items on top of a
piece of bread, roll or similar item, and placing another piece of
bread on top of the food items. Sandwiches with only one slice
of bread are less common and are called ”open face sandwiches”.
The food items inserted within the slices of bread are known as
”fillings” and often include sliced meat, vegetables (commonly
used vegetables include lettuce, tomatoes, onions, bell peppers,
bean sprouts, cucumbers, and olives), and sliced or grated cheese.
Explication Often, a liquid or semi-liquid ”condiment” or ”spread” such as
oil, mayonnaise, mustard, and/or flavored sauce, is drizzled onto
the sandwich or spread with a knife on the bread or top of the
sandwich fillers. The sandwich or bread used in the sandwich may
also be heated in some way by placing it in a toaster, oven, frying
pan, countertop grilling machine, microwave or grill. Sandwiches
are a popular meal to make at home and are available for purchase
in many cafes, convenience stores, and as part of the lunch menu
at many restaurants.
indoors (kitchen or restaurant or cafeteria) or outdoors (a park
scene
or backyard)
Evidences bread of various types; fillings (meat, cheese, vegetables), condi-
objects/people
ments, knives, plates, other utensils
slicing, toasting bread, spreading condiments on bread, placing
activities
fillings on bread, cutting or dishing up fillings
noises from equipment hitting the work surface; narration of or
audio commentary on the process; noises emanating from equipment
(e.g. microwave or griddle)

Table 2 System query for the event “E011 Making a sandwich”.


Event ID Name Category Relevance
sin346 133 food man made thing, food very relevant
sin346 183 kitchen structure building, room very relevant
yfcc609 505 cooking human activity, working utensil tool very relevant
sin346 261 room structure building, room relevant
Visual
sin346 28 bar pub structure building,commercial building relevant
yfcc609 145 lunch food, meal relevant
yfcc609 92 dinner food, meal relevant
sandwich, food, bread, fill, place,
meat, vegetable, cheese, condi-
ASR ASR long - relevant
ment, knife, plate, utensil, slice,
toast, spread, cut, dish
OCR OCR short sandwich - relevant

Table 4 Summary of the datasets for training semantic visual concepts. ImageNet features are trained on still images, and
the rest are trained on videos.
Dataset #Samples #Classes Category Example Concepts
DIY [48, 50] 72,000 1,601 Instructional videos Yoga, Juggling, Cooking
IACC [36] 600,000 346 Internet archive videos Baby, Outdoor, Sitting down
YFCC [40] 800,000 609 Amateur videos on Flickr Beach, Snow, Dancing
ImageNet [8] 1,000,000 1000 Still images Bee, Corkscrew, Cloak
Sports [22] 1,100,000 487 Sports videos on YouTube Bullfighting, Cycling, Skiing
14 Lu Jiang et al.
Table 9 Comparison of feature contribution.
Visual Concepts MAP
SysID ASR OCR MAP Drop(%)
IACC Sports YFCC DIY ImageNet 1-split 10-splits
MED13/IACC X X X X X X 18.93 18.61±1.13 9%
MED13/Sports X X X X X X 15.67 14.68±0.92 25%
MED13/YFCC X X X X X X 18.14 18.47±1.21 13%
MED13/DIY X X X X X X 19.95 18.70±1.19 4%
MED13/ImageNet X X X X X X 18.18 16.58±1.18 12%
MED13/ASR X X X X X X 18.48 18.78±1.10 11%
MED13/OCR X X X X X X 20.59 19.12±1.20 1%
MED14/IACC X X X X X X 18.34 17.79±1.95 11%
MED14/Sports X X X X X X 13.93 12.47±1.93 32%
MED14/YFCC X X X X X X 20.05 18.55±2.13 3%
MED14/DIY X X X X X X 20.40 18.42±2.22 1%
MED14/ImageNet X X X X X X 16.37 15.21±1.91 20%
MED14/ASR X X X X X X 18.36 17.62±1.84 11%
MED14/OCR X X X X X X 20.43 18.86±2.20 1%

Table 13 Event-level comparison of modality contribution on the NIST split. The best AP is marked in bold.
Event ID & Name FullSys FullSys+PRF VisualSys ASRSys OCRSys
E006: Birthday party 0.3842 0.3862 0.3673 0.0327 0.0386
E007: Changing a vehicle tire 0.2322 0.3240 0.2162 0.1707 0.0212
E008: Flash mob gathering 0.2864 0.4310 0.2864 0.0052 0.0409
E009: Getting a vehicle unstuck 0.1588 0.1561 0.1588 0.0063 0.0162
E010: Grooming an animal 0.0782 0.0725 0.0782 0.0166 0.0050
E011: Making a sandwich 0.1183 0.1304 0.1064 0.2184 0.0682
E012: Parade 0.5566 0.5319 0.5566 0.0080 0.0645
E013: Parkour 0.0545 0.0839 0.0448 0.0043 0.0066
E014: Repairing an appliance 0.2619 0.2989 0.2341 0.2086 0.0258
E015: Working on a sewing project 0.2068 0.2021 0.2036 0.0866 0.0166
E021: Attempting a bike trick 0.0635 0.0701 0.0635 0.0006 0.0046
E022: Cleaning an appliance 0.2634 0.1747 0.0008 0.2634 0.0105
E023: Dog show 0.6737 0.6610 0.6737 0.0009 0.0303
E024: Giving directions to a location 0.0614 0.0228 0.0011 0.0614 0.0036
E025: Marriage proposal 0.0188 0.0270 0.0024 0.0021 0.0188
E026: Renovating a home 0.0252 0.0160 0.0252 0.0026 0.0023
E027: Rock climbing 0.2077 0.2001 0.2077 0.1127 0.0038
E028: Town hall meeting 0.2492 0.3172 0.2492 0.0064 0.0134
E029: Winning a race without a vehicle 0.1257 0.1929 0.1257 0.0011 0.0019
E030: Working on a metal crafts project 0.1238 0.1255 0.0608 0.0981 0.0142
E031: Beekeeping 0.5883 0.6401 0.5883 0.2676 0.0440
E032: Wedding shower 0.0833 0.0879 0.0459 0.0428 0.0017
E033: Non-motorized vehicle repair 0.5198 0.5263 0.5198 0.0828 0.0159
E034: Fixing musical instrument 0.0276 0.0444 0.0170 0.0248 0.0023
E035: Horse riding competition 0.3677 0.3710 0.3677 0.0013 0.0104
E036: Felling a tree 0.0968 0.1180 0.0968 0.0020 0.0076
E037: Parking a vehicle 0.2918 0.2477 0.2918 0.0008 0.0009
E038: Playing fetch 0.0339 0.0373 0.0339 0.0020 0.0014
E039: Tailgating 0.1437 0.1501 0.1437 0.0013 0.0388
E040: Tuning musical instrument 0.1554 0.3804 0.0010 0.1840 0.0677
MAP (MED13Test E006-E015 E021-E030) 0.2075 0.2212 0.1831 0.0653 0.0203
MAP (MED14Test E021-E040) 0.2060 0.2205 0.1758 0.0579 0.0147
Text-to-Video: A Semantic Search Engine for Internet Videos 15

Table 14 Event-level comparison of visual feature contribution on the NIST split.


Event ID & Name FullSys MED/IACC MED/Sports MED/YFCC MED/DIY MED/ImageNet
E006: Birthday party 0.3842 0.3797 0.3842 0.2814 0.3842 0.2876
E007: Changing a vehicle tire 0.2322 0.2720 0.2782 0.1811 0.1247 0.0998
E008: Flash mob gathering 0.2864 0.1872 0.2864 0.3345 0.2864 0.2864
E009: Getting a vehicle unstuck 0.1588 0.1070 0.1588 0.1132 0.1588 0.1588
E010: Grooming an animal 0.0782 0.0902 0.0782 0.0914 0.0474 0.0782
E011: Making a sandwich 0.1183 0.0926 0.1183 0.1146 0.1183 0.1183
E012: Parade 0.5566 0.5738 0.5566 0.3007 0.5566 0.5566
E013: Parkour 0.0545 0.0066 0.0545 0.0545 0.0545 0.0545
E014: Repairing an appliance 0.2619 0.2247 0.2619 0.1709 0.2619 0.1129
E015: Working on a sewing project 0.2068 0.2166 0.2068 0.2068 0.1847 0.0712
E021: Attempting a bike trick 0.0635 0.0635 0.0006 0.0635 0.0635 0.0635
E022: Cleaning an appliance 0.2634 0.2634 0.2634 0.2634 0.2634 0.2634
E023: Dog show 0.6737 0.6737 0.0007 0.6737 0.6737 0.6737
E024: Giving directions to a location 0.0614 0.0614 0.0614 0.0614 0.0614 0.0614
E025: Marriage proposal 0.0188 0.0188 0.0188 0.0188 0.0188 0.0188
E026: Renovating a home 0.0252 0.0017 0.0252 0.0252 0.0252 0.0252
E027: Rock climbing 0.2077 0.2077 0.0009 0.2077 0.2077 0.2077
E028: Town hall meeting 0.2492 0.0956 0.2492 0.2418 0.2492 0.2492
E029: Winning a race without a vehicle 0.1257 0.1257 0.0056 0.1257 0.1257 0.1257
E030: Working on a metal crafts project 0.1238 0.1238 0.1238 0.0981 0.1238 0.1238
E031: Beekeeping 0.5883 0.5883 0.5883 0.5883 0.5883 0.0012
E032: Wedding shower 0.0833 0.0833 0.0833 0.0833 0.0924 0.0833
E033: Non-motorized vehicle repair 0.5198 0.5198 0.4440 0.5198 0.4742 0.4417
E034: Fixing musical instrument 0.0276 0.0276 0.0276 0.0276 0.0439 0.0276
E035: Horse riding competition 0.3677 0.3430 0.1916 0.3677 0.3677 0.3677
E036: Felling a tree 0.0968 0.0275 0.1100 0.0968 0.0968 0.0968
E037: Parking a vehicle 0.2918 0.1902 0.2918 0.2918 0.2918 0.1097
E038: Playing fetch 0.0339 0.0339 0.0008 0.0339 0.0339 0.0339
E039: Tailgating 0.1437 0.0631 0.1437 0.0666 0.1437 0.1437
E040: Tuning musical instrument 0.1554 0.1554 0.1554 0.1554 0.1554 0.1554
MAP (MED13Test E006-E015 E021-E030) 0.2075 0.1893 0.1567 0.1814 0.1995 0.1818
MAP (MED14Test E021-E040) 0.2060 0.1834 0.1393 0.2005 0.2050 0.1637
16 Lu Jiang et al.

Table 15 Event-level comparison of textual feature contribution on the NIST split.


Event ID & Name FullSys MED/ASR MED/OCR
E006: Birthday party 0.3842 0.3842 0.3673
E007: Changing a vehicle tire 0.2322 0.2162 0.2322
E008: Flash mob gathering 0.2864 0.2864 0.2864
E009: Getting a vehicle unstuck 0.1588 0.1588 0.1588
E010: Grooming an animal 0.0782 0.0782 0.0782
E011: Making a sandwich 0.1183 0.1043 0.1205
E012: Parade 0.5566 0.5566 0.5566
E013: Parkour 0.0545 0.0545 0.0448
E014: Repairing an appliance 0.2619 0.2436 0.2527
E015: Working on a sewing project 0.2068 0.1872 0.2242
E021: Attempting a bike trick 0.0635 0.0635 0.0635
E022: Cleaning an appliance 0.2634 0.0008 0.2634
E023: Dog show 0.6737 0.6737 0.6737
E024: Giving directions to a location 0.0614 0.0011 0.0614
E025: Marriage proposal 0.0188 0.0188 0.0024
E026: Renovating a home 0.0252 0.0252 0.0252
E027: Rock climbing 0.2077 0.2077 0.2077
E028: Town hall meeting 0.2492 0.2492 0.2492
E029: Winning a race without a vehicle 0.1257 0.1257 0.1257
E030: Working on a metal crafts project 0.1238 0.0608 0.1238
E031: Beekeeping 0.5883 0.5883 0.5883
E032: Wedding shower 0.0833 0.0833 0.0459
E033: Non-motorized vehicle repair 0.5198 0.5198 0.5198
E034: Fixing musical instrument 0.0276 0.0314 0.0178
E035: Horse riding competition 0.3677 0.3677 0.3677
E036: Felling a tree 0.0968 0.0968 0.0968
E037: Parking a vehicle 0.2918 0.2918 0.2918
E038: Playing fetch 0.0339 0.0339 0.0339
E039: Tailgating 0.1437 0.1437 0.1437
E040: Tuning musical instrument 0.1554 0.0893 0.1840
MAP (MED13Test E006-E015 E021-E030) 0.2075 0.1848 0.2059
MAP (MED14Test E021-E040) 0.2060 0.1836 0.2043

View publication stats

You might also like