The Usefulness of Multimedia Surrogates For Making Relevance

Information Processing and Management 56 (2019) 102091
Contents lists available at ScienceDirect
Information Processing and Management

journal homepage: www.elsevier.com/locate/infoproman
The usefulness of multimedia surrogates for making relevance

T
judgments about digital video objects
Barbara M. Wildemutha, , Gary Marchioninia, Xin Fub, Jun Sung Oha, Meng Yangc
⁎
a
University of North Carolina at Chapel Hill, United States
b
Data Science, Facebook 1 Hacker Way Menlo Park, CA 94025, United States
c
User Experience and Customer Insights, NetBrain 15 Network Drive Burlington, MA, 01803, United States
ARTICLE INFO ABSTRACT
Keywords: Large collections of digital video are increasingly accessible. The large volume and range of
Digital video collections available video demands search tools that allow people to browse and query easily and to quickly
Surrogates make sense of the videos behind the result sets. This study focused on the usefulness of several
Information searching multimedia surrogates, in terms of effectiveness, efficiency, and user satisfaction. Three surro-
Relevance
gates were evaluated and compared: a storyboard, a 7-second segment, and a fast forward.
Thirty-six experienced users of digital video conducted searches on each of four systems: three
incorporated one of the surrogates each, and the fourth made all three surrogates available.
Participants judged the relevance of at least 10 items for each search based on the surrogate(s)
available, then re-judged the relevance of two of those items based on viewing the full video.
Transaction logs and post-search and post-session questionnaires provided data on user inter-
actions, including relevance judgments, and user perceptions. All of the surrogates provided a
basis for accurate relevance judgments, though they varied (in expected ways) in terms of their
efficiency. User perceptions favored the system with all three surrogates available, even though it
took longer to use; they found it easier to learn and easier to use, and it gave them more con-
fidence in their judgments. Based on these results, we can conclude that it's important for digital
video collections to provide multiple surrogates, each providing a different view of the video.
1. Introduction
Viewing videos online is part of the everyday life of millions of people. It is projected that, by the end of 2019, there will be 232.1
million online video viewers in the United States (eMarketer, 2018). These U.S. users spend an average of almost 70 min per week
watching online video, and those in the 18–34 age range average about 94 min per week (Q1, 2018; Nielsen, 2018). While much of
this viewing is done by YouTube users, watching a billion hours of video each day,1 viewers also use other platforms such as Facebook
and Snapchat (Bloomberg, 2016; TechCrunch, 2009). While many viewers are watching digital video to be entertained or inspired,
there are also many that have more content-specific motivations, such as learning something new or staying up to date on a particular
topic (Google, 2016). As noted by Follett (2015), “a lot of [viewers] are watching the latest viral video with a goofy cat or a cute kid,
but an awful lot of them are looking for advice on how to do something or how to make something work better.”
In addition to the popular video streaming sites, large collections of digital video are increasingly accessible and include diverse
⁎
Corresponding author.
E-mail address: wildemuth@unc.edu (B.M. Wildemuth).
1
https://www.youtube.com/yt/about/press/.
https://doi.org/10.1016/j.ipm.2019.102091
Received 19 November 2018; Received in revised form 6 June 2019; Accepted 23 July 2019
0306-4573/ © 2019 Elsevier Ltd. All rights reserved.
B.M. Wildemuth, et al. Information Processing and Management 56 (2019) 102091
genres and lengths (see, for example, the listings of video collections maintained by the University of Minnesota Libraries2 and the
Arizona State University Library3). Just a few examples illustrate the range of such collections and their possible uses. The American
Archive of Public Broadcasting includes “thousands of high quality programs that have had national impact” as well as “regional and
local programs that document American communities during the last half of the twentieth century and the first decade of the twenty-
first.”4 Folkstreams is a non-profit organization “dedicated to finding, preserving, contextualizing, and showcasing documentary films
on American traditional cultures.”5 The films in its collection “are often produced by independent filmmakers and focus on the
culture, struggles, and arts of unnoticed Americans from many different regions and communities.” Many of these films are directly
linked to significant published research, illustrating the way in which online videos can contribute to our cultural knowledge.
ScienceCinema6 “highlights scientific videos featuring leading-edge research from the U.S. Department of Energy.” It exemplifies one
means for dissemination of government-generated scientific findings. These few examples serve only to illustrate the many collections
currently available (see the listings compiled by the University of Minnesota Libraries and the Arizona State University Library for
additional examples).
This large volume and range of available video demands search tools that allow people to browse and query easily
(Albertson, 2013) and to quickly make sense of the videos behind the result sets. Surrogates, such as the textual ‘snippets’ and poster
frames provided in the results lists of most search engines, are essential components of user interfaces for all search systems but are
even more crucial for video collections since the full meaning of a video is determined from features in multiple sensory channels.
Linguistic meaning is typically carried in the audio channel, although superimpositions of text can augment this meaning in the visual
channel. Visual meaning, such as action and graphical style, is mainly carried in the visual channel, and affective meaning, such as
mood, is often carried in both channels through music, verbal tone, facial expressions, and visual effects. Effective video surrogates
should capture many of these types of meanings – a goal not fully achievable with textual surrogates such as keywords or written
descriptions alone. As reported by video users, one of the current barriers to using online video collections is the lack of high quality
surrogates (Albertson, 2016).
The multimodal aspects of video suggest the need for either more detail in surrogates or layers of surrogates at different gran-
ularities. The bulk of surrogates in today's video retrieval systems include only a poster frame (i.e., a single keyframe), a title, and a
text snippet.7 While many online videos are short (e.g., the videos viewed most frequently in YouTube are 3–4 min long; Rossetto &
Schuldt, 2018), which invites immediate playing rather than examining surrogates, we believe that the volume of video, whether
short or long, increases the need for surrogates that allow people to quickly filter what is available.
Although there is an enormous amount of technical research directed at creating such surrogates, there is little work on un-
derstanding how people can and will actually use them (Lokoč, Bailer, Schoeffmann, Muenzer, & Awad, 2018; Schoeffmann, Hudelist,
& Huber, 2015). In addition, while improvements in the hardware and networking infrastructure has greatly decreased the burden of
downloading relatively-long videos, the human time required to examine, select, and manipulate such videos has not really been
decreased by this improved infrastructure. To do that, we need surrogates that people can evaluate quickly, with a high degree of
fidelity to make their assessments of the relevance of the full video. Therefore, this study examined and compared people's selection
and use of several different digital video surrogates.
2. A framework for surrogation
Surrogates “stand for” objects. Abstracts, titles, and keywords are all familiar surrogates for complete documents, broadly defined
(Buckland, 1997), and have long been central to library and information science research (e.g., Borko & Bernier, 1975). Surrogates
continue to play an important role in the World Wide Web, appearing as “snippets” on search results pages. Taking a simplistic view,
the purpose of a surrogate is to instantiate the concept of information compaction: the ratio of the time needed to view the original
information object to the time needed to view the surrogate (Wildemuth et al., 2002). Surrogates provide an important alternative to
primary objects, as they take far less time to examine while simultaneously providing enough semantic cues to extract gist and
allowing users to assess the need for further examination of other surrogates and the primary object.
A general framework for understanding the key criteria for designing surrogates for videos would be human centered and would
include multiple media considerations in both the primary information object and the surrogates. Some key points in such a fra-
mework are outlined here, by defining surrogates as representations, describing their different uses, considering how people interact
with information objects and their surrogates, discussing several design factors, and examining current practice in surrogate im-
plementation.
2
https://libguides.umn.edu/c.php?g=845556.
3
https://libguides.asu.edu/StreamingVideo/Internet_Sites.
4
http://americanarchive.org/about-the-american-archive.
5
http://www.folkstreams.net/about.php.
6
https://www.osti.gov/sciencecinema/.
7
A dozen online video collections (including YouTube) were analyzed in terms of the surrogates provided. These included the American Archive
of Public Broadcasting (cited above), Culture Unplugged (https://www.cultureunplugged.com/filmedia/), Folkstreams (cited above), Media Burn
(https://mediaburn.org/), the Moving Image Archive of the Internet Archive (https://archive.org/details/movies), the National Film Board of
Canada (https://www.nfb.ca/films/), the US National Park Service B-roll films (https://www.nps.gov/media/multimedia-search.htm), NOVA
(https://www.pbs.org/wgbh/nova/programs/), PBS Video (https://www.pbs.org/video/), ScienceCinema (cited above), the WGBH Open Vault
(http://openvault.wgbh.org/), and YouTube. None of these included storyboards, skims, or fast forward surrogates.
2
2.1. Surrogates as representations for representations
All aspects of recorded human knowledge are representations for the natural world or the internal workings of human minds.
These representations are necessarily abstractions of the reality they represent, as seen in reductions in both detail and di-
mensionality. For example, a photograph captures a portion of a scene in some resolution of detail but does not capture the smell of
the scene; it is a compression of detail and a reduction in the richness of the scene onto a single visual dimension. Multimedia
representations are powerful because they suffer less overall dimensionality reduction, although they may compress most of the detail
in those dimensions and, of course, lose other dimensions. Over our lives, we use these representations as if they were primary
information objects, to make sense of the world and to guide action, so much so that habituation often blurs the distinction between
the base reality and the representation. Because there are so many information objects available, people create retrieval systems that
support finding the most relevant objects on demand. With such systems, surrogates are secondary representations for the recorded
representations of reality – the primary objects being stored in the system – serving both as cues for retrieval and as cues for
understanding the primary information object. As O'Connor (1985) noted, “A surrogate may be seen as something which shares
sufficient attributes with that for which it stands, so that in those situations for which it is intended to be a surrogate, it would be
classified the same as its referent” (p.211).
2.2. Use of surrogates for finding, filtering, and sense making
As information retrieval systems were just beginning to be designed, researchers realized that surrogates could (and should) serve
at least two purposes: “1) to determine whether a document is relevant for a specific purpose, and 2) to find out some information
from the document without having to read it in its original form” (Rath, Resnick, & Savage, 1961, p.126). This view persists, as
Marchionini and White (2007) argue that a good surrogate will have a primary use and many secondary uses, ranging from “finding”
to “understanding.” Some surrogates may be optimized for retrieval (e.g., a specific URL or ISBN), while others may be designed to
convey substantial meaning (e.g., an abstract or critical review). Although surrogates were traditionally created to support retrieval,
they are increasingly important as an aid in filtering the objects retrieved and interpreting the collective results to meet the searcher's
problem needs. Surrogates advance the filtering process by helping people attain enough understanding about the primary object to
decide whether to invest effort in careful examination (Albertson & Johnston, 2017), and they advance problem solving by helping
people to understand the specific primary objects as well as the relationships among primary objects. We adopt the term ‘sense
making’ to describe these processes of understanding and problem solving, since our meaning is close to that associated with Dervin's
Sense Making Theory (Dervin & Frenette, 2001).
Because easy access to digital video is a relatively recent development,8 theories of how people make sense from video must look
to theories from general cognition and from still image and film understanding. One important theory of cognition in this respect is
Paivio's (1986) dual encoding framework, which postulates that people are able to easily move between verbal and non-verbal stimuli
and in fact create deeper meaning by having both kinds of encoding for a concept or process. Thus, well-designed multimodal
surrogates should be helpful to people as they search for and make sense of video content.
Focusing on the content itself to explain how people make sense of images and film, Panofsky (1955) defined three levels of
meaning for art images. The pre-iconographical level addresses the physical object represented (e.g., a dog), the iconographic level
represents the specific subject matter (e.g., the dog named Lassie), and the iconological level represents the symbolic, highly personal
meaning (e.g., the need for a trusted friend who gives unconditional love). This theory has been used by a variety of image retrieval
researchers (e.g., Enser, 2000), and Jorgensen (2003) synthesized the literature to present a multi-layered model of image attributes.
Four of her layers include syntactic attributes, while six semantic layers are concerned with concepts. Yang and Marchionini (2005)
have built on this work with still images to conduct empirical studies of experts seeking video assets, but there is much more to learn
because video adds a distinct temporal factor that must also be taken into account.
Contemporary film theorists typically argue that viewers actively construct their own meaning from film and focus on a narrative
emerging over time as the film is viewed. This active construction of meaning parallels general cognitive theories that posit active
construction of mental models to understand the world. Video retrieval researchers have understandably focused on the features of
video that might be leveraged to support retrieval and understanding. The CMU Informedia Project (Christel, 2008; Christel et al.,
1998) and the IBM MARVEL system (Naphade & Smith, 2003; Natsev, Tešić, Xie, Yan, & Smith, 2007) both leverage a multiplicity of
features to automatically index video. Dimitrova (2003) provides a model for video research that builds from the myriad of low-level
features amenable to signal processing (e.g., pixel bit patterns, audio frequencies) through to high-level semantic features (e.g., video
skims, video highlights). These efforts represent attempts to build semantically meaningful video surrogates that will influence
people's effort and outcomes.
2.3. Interactions with surrogates
Information foraging theory (Pirolli & Card, 1999) and the principle of least effort (Zipf, 1949) both suggest that, as people
interact with information, they take into account the effort required, as well as the outcomes achieved. Engagement with digital
videos and their surrogates requires that people balance the economic, physical, perceptual, and cognitive loads they experience,
8
YouTube was created in 2005; Netflix began offering its streaming service in 2007 (see the relevant Wikipedia entries for details).
3
across modalities. Kalyuga, Chandler, and Sweller (1999), Mayer and Moreno (1998), and Tindall-Ford, Chandler, and Sweller (1997)
have all demonstrated how multimedia designs that balance modality loads can lead to superior learning effects. In addition, these
tradeoffs are complicated by the finding that people's interaction preferences may not coincide with their performance outcomes. For
example, Frøkjær, Hertzum, and Hornbæk (2000) demonstrated that effectiveness, efficiency, and satisfaction do not tend to be
correlated. Ding, Marchionini, and Tse (1997) found that people interacting with digital videos preferred having storyboard displays
over slide shows, even though slide shows supported comparable performance more quickly. Other studies have shown the efficacy of
giving people more control in retrieval (e.g., Koenemann & Belkin, 1996) and the ways that user perception of control and perceived
task difficulty support flow in computer supported cooperative work (e.g., Ghani, Supnick, & Rooney, 1991). Clearly, there are
interactions among level of user control, affective response, mental load, and performance, that should be taken into account when
designing surrogates for digital videos.
2.4. Designing surrogates for digital videos
Efficient surrogates that are easy to understand and that accurately represent primary information are dependent on several
design parameters. The most important are its level of abstraction and level of compaction, both briefly discussed above. Two
additional design parameters – medium of representation and origination – are discussed here. First, a surrogate designer must choose
the medium of representation. The medium will determine whether a surrogate is static or dynamic (temporally dependent), how it
can be experienced by humans, and how costly it is to engineer and deliver. The medium of text is most commonly used because
human language offers great richness for describing what we know and because so much of our primary information has been
traditionally represented as text. In addition, efficient and effective tools and techniques for text surrogates have evolved over
hundreds of years. Thus, words (keywords, titles, abstracts, etc.) are traditionally used as surrogates for not just textual primary
objects, but also for photographs, recordings, and film (e.g., Borko & Bernier, 1975; Burke, 1999; Rafferty & Hidderley, 2005).
Because it is logical to use the same medium for a surrogate as is used in the primary information object, textual surrogates for
digital videos are often augmented with surrogates in other media. Digital media have greatly expanded these surrogation efforts in
general (Jorgensen, 2003) and there has been considerable attention to visual surrogates for video and film. O'Connor (1985) was an
early advocate of new kinds of surrogates for film, defining keyframes over thirty years ago. Just as keywords might be used to
represent textual documents, keyframes may be used as a basis for non-linguistic representations of video objects, and much effort has
focused on identifying and extracting keyframes. Rorvig (1993) demonstrated the feasibility of creating visual surrogates based on
extracted keyframes, and the creation of specific summaries has become an important aspect of the information retrieval research
community's interest in multimedia (e.g., Smeaton, 2001).
A second parameter in surrogate design is origination, which refers to whether the surrogate is extracted from the object (e.g., a
keyframe) or originally constructed (e.g., a customized trailer for a film). This facet is particularly important for automatic surro-
gation efforts. A classic distinction in text indexing is exemplified by keywords and subject headings. The former typically are selected
from the original text and the latter are selected from a controlled vocabulary (e.g., ACM classification, the gene ontology, etc.). The
automatic extraction of single words or short phrases has been found to be quite effective for creating meaningful surrogates. For
example, Jones, Jones, and Deo (2004) demonstrated that automatically generated keyphrases were as effective as full document
titles when using a PDA display, and Church, Smyth, and Keane (2006) demonstrated that keywords are more effective in small
displays than typical search engine snippets.
While surrogates for digital video are typically visual, skims and trailers use both visual and audio modalities, and surrogates such
as spoken keywords are strictly audio. The dominance of visual surrogates is mainly due to the fact that most of the video retrieval
research and development efforts over the past few decades focused on visual features (e.g., luminosity, color, texture, shapes) of
video, and the metadata and resulting surrogates were constructed to take advantage of these features. The many years of TREC Video
results9 readily demonstrate the importance of linguistic data (in text format) for retrieval; nonetheless, several studies (reviewed in
Frøkjær et al., 2000) have demonstrated that people like to have visual surrogates regardless of their performance effects. Thus, it
seems likely that surrogates for digital video will continue to incorporate multiple modalities: textual, visual, and audio. Results from
the Informedia project (Hauptmann, 2005; Smith & Kanade, 1998), the CueVideo Project (Amir et al., 2000) and the Open Video
project (Wildemuth et al., 2002, 2003) all suggest that combining multiple surrogates of different modalities is effective for helping
users retrieve and make sense of the primary information objects.
Creating user interfaces that support interactive search and browse capabilities in digital video collections has also received
attention. The Informedia project (Christel, 2008; Hauptmann, 2005) is perhaps the most comprehensive digital video effort that
includes both novel user interfaces and usability testing. Their video skims are surrogates created from several kinds of features
(transcripts, keyframes extracted with color and texture features, superimpositions, and other features such as face recognition). The
Físchlár Project (Smeaton et al., 2001) stores and provides access to video programming from broadcast TV. They have developed
user interfaces that integrate several different types of surrogates to help users find video (Lee & Smeaton, 2002). The Open Video
project developed and investigated the usability of several surrogates and made them available through a Web-based interface
(Marchionini, Wildemuth, & Geisler, 2006). The CueVideo system (Amir et al., 2000; Srinivasan, Ponceleon, Amir, & Petkovic, 1999)
extracts a variety of features as the basis for indexing (e.g., using speech to text analysis, image analysis, event analysis) and has been
the basis for more specific user interface techniques that provide visual patterns for where query images occur in lists of video
9
http://trecvid.nist.gov/.
4
segments. The SmartSkip interface (Drucker, Glatzer, DeMar, & Wong, 2002) is one of the few interfaces that provide innovative fast
forwards beyond the digital TV fast forwards. Their user study compared a standard skip interface and a fast forward interface with a
user-controllable SmartSkip interface. They found that, although people found the SmartSkip interface more ‘fun’ to use, they per-
formed better with the standard skip interface than with the other two interfaces on commercial-skipping and weather-finding tasks.
These results parallel studies of slide shows and storyboards (Tse, Marchionini, Ding, Slaughter, & Komlodi, 1998) that demonstrate
that, although people are able to perform effectively on retrieval tasks with very rapid slide shows, they strongly prefer the story-
board interfaces that give them more control but take more time to use.
3. Studies of people's interactions with surrogates
Because surrogates for primary information objects have been in use since collections of those objects were gathered into libraries,
there are too many relevant studies to thoroughly review here. Therefore, we'll focus our attention on key empirical studies of people
making relevance judgments about books and journal articles, based on textual surrogates; of people making relevance judgments
about Web pages, based on snippets on search engine results pages (SERPs); and of people's interactions with digital videos and their
surrogates.
3.1. Making relevance judgments about books and journal articles
In the 1960′s, as early information retrieval systems for the research literature were being designed, there was no consensus about
which metadata might serve most effectively as a surrogate for the full document. Thus, a number of studies (e.g., Kent et al., 1967;
Rath et al., 1961; Thompson, 1973) compared the relative usefulness of different surrogates. Several of these early studies, like the
current study, used the relevance judgment based on the full text of the document as the standard for the correctness of the relevance
judgment based on a particular surrogate. The underlying assumption is that the surrogate that can lead to the same relevance
judgment as the judgment made from the full text is the best surrogate. The outcome of these early studies was a focus on titles,
subject terms, and abstracts as surrogates for textual documents.
In the late 1980’s and into the 1990’s, interest in users’ understanding of relevance was renewed (e.g., Harter, 1992; Schamber,
Eisenberg, & Nilan, 1990; Sperber & Wilson, 1986). Along with this conceptual work, a number of empirical studies were conducted,
a few of which focused on the surrogates used to make relevance judgments. For example, Wang and Soergel (1998) examined the
“document information elements” used by people as they were making relevance judgments, and Barry (1998) examined the way
people used bibliographic information (i.e., descriptive cataloging), descriptive notes, the abstract, and the indexing terms to make a
decision about whether to further pursue (or not) a particular document. Of particular interest for the current study was Janes’ (1991)
study of the ways in which user relevance judgments change as they receive additional information about the document. Study
participants viewed the document title first, followed by two of three additional surrogates incrementally: additional bibliographic
information, abstracts, and indexing terms; the order was counterbalanced across groups of participants. The surrogates were
evaluated in terms of changes in judgments when additional information was received. (The current study uses a similar approach,
comparing the relevance judgments made based on a particular surrogate with the judgments made by the same user after viewing
the complete video.) Neither the additional bibliographic information or the indexing terms resulted in changes in users’ judgments;
the addition of abstracts “produced the greatest number and magnitude of changes in users’ judgments” (p.642), corroborating the
older studies suggesting that abstracts are a useful surrogate.
It should be noted that the studies reviewed here focused on the quality of the relevance judgments made, based on the surrogates
provided. However, this is not the only criterion for success that might be important. As noted in the discussion of information
compaction above, the time required to interact with a surrogate should also be taken into account. For example, while an abstract
might provide stronger support for making relevance judgments than a title would, it will take the user quite a bit longer to read the
abstract than to read the title. A second criterion for success that was generally not examined in these studies is the user's affective
response to the surrogate or preference for a particular surrogate. In many settings, including interactions with surrogates, users often
prefer options that are not necessarily the most efficient or that provide performance improvements. Thus, evaluations of surrogates
should take into account at least three things: user performance, efficiency, and user preferences. The current study does consider
these three types of outcomes.
3.2. Making relevance judgments about web pages
With the development of the World Wide Web in the early 1990s, the context of making relevance judgments changed in several
important ways. First, the full document was available for viewing almost immediately, in comparison to requiring a special trip to
the library as in earlier decades. Surrogates for Web pages were still needed, so that searchers could simultaneously view multiple
results from a search, but the burden of making an overly-optimistic assessment of a document's relevance was greatly reduced.
Second, the Web quickly included multimedia documents, opening up possibilities for image or moving image surrogates. Third, the
full text of a document could be manipulated in various ways, supporting the creation of innovative textual surrogates, beyond the
traditional snippet based on the meta description provided by the author/publisher, text extracted from the full document, or a
combination of these (Spencer, 2010).
While the relationship of the surrogates to users’ relevance judgments shifted with the introduction of the World Wide Web, it
continued to be true that “the quality of the document surrogate has a strong effect on the ability of the searcher to judge the
5
relevance of the document” (Hearst, 2009, p.120). For the purposes of the current study, evaluations of the visual/multimedia options
for surrogate creation are of primary interest, including thumbnails of the web page and other images extracted from the web page.
A thumbnail of a Web page is “a miniature image of a Web page” (Al Maqbali, Scholer, Thom, & Wu, 2010, p.37), and are an
expected surrogate in a digital video collection (Albertson & Ju, 2015). Thumbnails of pages previously seen can be quickly re-
cognized by users (Robertson et al., 1998), but their value for searching and making relevance judgments on pages not previously
seen required investigation. Some studies compared text-only surrogates, thumbnail-only surrogates, and combinations. Both
Dziadosz and Chandrasekar (2002) and Aula, Khan, Guan, Fontes, and Hong (2010) found the combination to be most effective in
supporting accurate relevance judgments. Dziadosz and Chandrasekar found that the combination was particularly helpful in re-
ducing the number of incorrect not-relevant judgments, and Aula et al. supported that finding by demonstrating that, with just
thumbnails, people underestimated a page's relevance and, with text snippets only, people overestimated a page's relevance.
A number of researchers have examined enhanced thumbnails, sometimes called visual snippets. Woodruff, Rosenholtz, Morrison,
Faulring, and Pirolli (2002) enhanced the thumbnail by increasing the size of the key text on the page, so it could be read on the
thumbnail. Jiao, Yang, Xu, and Wu (2010) used a similar approach, but also added the logo from the page, if available. Al Maqbali
et al. (2010) also experimented with visual tags (i.e., word clouds) as surrogates. Results from comparisons of these various visual
surrogates were mixed; for example, sometimes the visual snippet was more efficient, but in other studies there was no advantage.
Additional studies of such visual surrogates may be worthwhile, particularly by extending them to collections of moving images and
digital video.
An alternative to the thumbnail is to use an image extracted from the Web page, in combination with a text snippet. The challenge
to this approach is, of course, to select the image that might be most useful as (part of) a surrogate. The various names used for these
images suggests the variety of approaches used to select them: salient images (Al Maqbali et al., 2010; Teevan et al., 2009); dominant
images (Jiao et al., 2010; Li, Shi, & Zhang, 2008), including images selected from other pages (Jiao et al., 2010); high-scent images
(Loumakis, Stumpf, & Grayson, 2011); or good images (Capra, Arguello, & Scholer, 2013). In most cases, the selected images were
used to augment existing textual surrogates, and the resulting composite was compared with other surrogates in terms of user
performance and/or preference. Again, the results were mixed. Teevan et al. (2009) found no difference in the time required to
complete searches, but participants clicked on the fewest results when using text snippets and most when using thumbnails, with the
enhanced visual snippets in between. Li et al. (2008) found the combination of text snippet and dominant image to be most efficient,
in terms of both time to complete the search and the number of clicks needed to complete the task. Loumakis et al. (2011) found no
difference in efficiency, but study participants did prefer the surrogates that included high-scent images. Capra et al. (2013) found
slight improvements in judgment accuracy when good images were added to the textual surrogate, but it took slightly longer to
complete the search tasks. Clearly, surrogates that incorporate images from the retrieved Web pages should be investigated further.
3.3. User interactions with digital videos and their surrogates
The temporal nature and multiple channels of video content exacerbate the need for surrogates that offer more representation
facets than text. While one study (Zhang & Li, 2008) found that their study participants wanted to be able to view several textual
elements in order to make accurate relevance judgments about moving image materials, our focus here is on surrogates that are
oriented toward the visual and audio channels present in digital videos. Many options have been considered and designed, but only a
few options have been evaluated in terms of user interactions with them. Non-textual surrogates that have been evaluated include
those based on the static display of specific frames (key frames and storyboards), a moving image display of selected key frames (slide
shows and fast forwards), and various composite and multimodal surrogates (e.g., skims and other surrogates combining audio and
images). Findings related to each of these – alone, in combination, and in combination with textual surrogates – are summarized here.
These studies have benefitted from the methodological work conducted as part of the TRECVID workshops (summarized in Smeaton,
Over, & Kraaij, 2006; Awad et al., 2017, 2018); the VideOlympics (Snoek et al., 2008); the Video Browser Showdown, conducted as a
special session at the International Conference on MultiMedia Modeling (MMM) (described in Cobârzan et al., 2017; Schoeffmann,
2014; Lokoč et al., 2019); and the measures developed by Yang et al. (2003a,b).
3.3.1. Static display of selected keyframes

Many video representations or summarizations incorporate a static display of keyframes extracted from the original video (Hu,
Xie, Li, Zeng, & Maybank, 2011; O'Connor, 1985; Petrelli & Auld, 2008). In most video retrieval systems, one keyframe is selected per
shot, defined as “a single continuous camera operation without an editor's cut, fade or dissolve” (Christel, 2008, p.22). The shot's
keyframe is selected to be representative of the frames in the shot; that selection process may be based on shape and color analysis
(Schoeffmann, Hopfgartner, Marques, Boeszoermenyi, & Jose, 2010); timing, e.g., the first frame in the shot or the frame at the
midpoint of the shot; or the presence of query terms in the transcript associated with the frame (Christel, 2006). Alternatively, a
keyframe may be selected to represent a scene, defined as “a group of contiguous shots that are coherent with a certain subject or
theme” (Hu et al., 2011, p.802), the entire video, or a sub-shot.
When a keyframe is intended to represent an entire video, it is usually called a poster frame and is often accompanied by textual
metadata. This type of surrogate was one of the earliest to be incorporated in video retrieval systems and is still in wide use in
commercial systems, such as YouTube, and mobile versions of systems (Chen, Wang, Liu, & Lu, 2015). Users report that poster frames
are helpful in making relevance judgments (Cunningham & Nichols, 2008). An example study of the use of poster frames was
conducted by Goodrum (1997, 2001), comparing poster frames with another keyframe-based surrogate (a set of five keyframes,
including the poster frame) and two text-only surrogates (title, keywords). These surrogates were evaluated in terms of the congruity
6
between similarity judgments based on the surrogates and similarity judgments based on viewing the full video. The two keyframe-
based surrogates outperformed the two textual surrogates. Taking a different approach, Hughes, Wilkens, Wildemuth, and
Marchionini (2003) conducted an eye-tracking study to see how people used a poster frame and textual metadata in combination.
Two interfaces were compared: one with the poster frame to the left of the text, and the other with the poster frame to the right of the
text. From their results, they concluded that people tended to make their initial relevance judgments based on the textual metadata,
confirming that judgment with the poster frames.
One of the alternative surrogates used by Goodrum was a storyboard consisting of five keyframes for each video. A storyboard is a
“grid-like visualization of keyframes” arranged in sequential order (Schoeffmann et al., 2010, p.10; Christel, 2008). An important
tradeoff in the design of a storyboard is between the number of keyframes to include and the size of those keyframes. Most studies
(e.g., Goodrum, 1997; Wildemuth et al., 2002) have found that user interactions with storyboards are both efficient and effective. As
noted by Christel (2008), “storyboard interfaces… consistently and overwhelmingly produced the best interactive search perfor-
mance” and they “are the most frequently employed interface in video libraries” (p.25), based on TRECVID results over several years.
This conclusion was confirmed more recently, as Westman (2010) found that storyboards supported user performance as well as more
dynamic surrogates (fast forwards and scene clips). Since the usefulness of storyboards has been established, researchers have gone
on to examine alternative ways to select the keyframes to be included in a storyboard (e.g., Jacob, Pádua, Lacerda, & Pereira, 2017),
the size and number of keyframes (Furini, Geraci, Montangero, & Pellegrini, 2010), the layout and scalability of the storyboard for
mobile devices (Herranz & Jiang, 2016; Low, Hentschel, Stober, Sack, & Nürnberger, 2017; Mei, Yang, Yang, & Hua, 2008;
Schoeffmann, Münzer, Primus, Kletz, & Leibetseder, 2018), color coding of the storyboard frames (Hürst, Ching, Schoeffmann, &
Primus, 2017), and the interactivity supported in the storyboard (Hürst & Klappe, 2017; discussed further in the later section on
dynamic displays).
A number of studies have evaluated the ways in which textual metadata might be combined with either a poster frame or a
storyboard. Ding, Soergel, and Marchionini (1999) compared storyboards of 12 keyframes, keywords (six per item), and the com-
bination of the two; while using the keyword-only surrogates was most efficient, the combination was considered most useful by the
users and was preferred by the users. Christel and Warmack (2001) compared five storyboard surrogates that varied in the way
textual metadata was added. One of the surrogates had no text, two included full transcripts, and two included a single line caption
for the storyboard; other variations were in layout. Their overall findings indicated that the combination storyboards were more
efficient and were preferred, with smaller differences across the variations of text length or placement.
A typical storyboard presents the selected keyframes in sequential order, i.e., in the order in which they appear in the original
video. However, alternative ways to organize or visualize keyframes have also been investigated. Assuming a single keyframe is
representative of a shot within the video, Divakaran, Forlines, Lanning, Shipman, and Wittenburg (2005) used both a Fisheye layout
and a Squeeze layout to increase navigation accuracy as users were fast-forwarding through a video. Instead of using a sequential
ordering, Fan, Elmagarmid, Zhu, Aref, and Wu (2004) and Snoek et al. (2009) clustered the keyframes by the similarity of their
content. Alternatively, Adcock, Cooper, Girgensohn, and Wilcox (2005) selected a small number of keyframes, based on their match
with the textual query, and arranged them in a collage. Researchers have also developed tree-like interfaces (e.g., Girgensohn,
Shipman, & Wilcox, 2011; Jansen, Heeren, & van Dijk, 2008; Taskiran et al., 2004) and self-organizing maps based on keyframes
(Bärecke, Kijak, Nürnberger, & Detyniecki, 2006). Taking a completely different approach, Nguyen, Niu, and Liu (2012) used volume-
rendering techniques to freeze a sequence of action into a 3-dimensional still.
Keyframes can also be used to represent different chunks of the full video. For example, it can easily be argued that, for news
video, the story is the most important unit. Story segmentation is a task that has been incorporated into the TRECVID evaluations, and
so advances have been made in methods for automatically identifying story boundaries. These methods have been incorporated in the
development of surrogates in several systems, such as the Físchlár-News system (Smeaton, Gurrin, & Lee, 2006) and a video table of
contents (Goeau, Thièvre, Viaud, & Pellerin, 2007). Starting from the users’ behaviors, Al-Hajri, Miller, Fels, and Fong (2013) seg-
mented the video based on the viewing history of those users; these segments were then represented in the interface with a story-
board, and found to be more effective than the traditional storyboard in terms of number of questions about the video that could be
answered correctly and the time needed to find the clips that would answer those questions.
As can be seen in the studies cited here, many different interfaces have been developed for the display of keyframes as surrogates
for video shots or other chunks. The one thing that they all have in common is the idea that a keyframe can serve as a “good”
surrogate for a particular portion of a video. The validity of this quality judgment has been borne out in research studies and with
current practices in digital video libraries. However, all of these approaches sacrifice the dynamic nature of video. To preserve that
attribute, some attempts have been made to develop surrogates that are dynamic.
3.3.2. Moving image display of specific frames

Both slide shows and fast forward surrogates have been developed and tested for their effectiveness in representing the original
video on which they are based. Both begin by selecting keyframes, just as is done with a static display. However, instead of the
selected keyframes being displayed as stills, they are “played” by the interface. Slide shows may include any number of keyframes,
and those may be displayed at any speed, either automatically or under user control.
Several user studies of slide shows were undertaken in the late 1990s and the following years. Ding et al. (1997) investigated the
speed of the slide show, allowing the user to control speed (from 1 to 16 keyframes per second). They concluded that, to achieve
sufficient accuracy in object identification and gist determination, slide shows should move no faster than 8–12 keyframes per
second. Comparing a 12-keyframe-per-second slide show with storyboards, Komlodi and Marchionini (1998) found some advantages
for storyboards, but most differences were not statistically significant. Tse et al. (1998) study comparing a 4-keyframe-per-second
7
slide show with a storyboard found that participants completed object recognition tasks more accurately with the storyboard and that
there was no performance difference on gist determination. While these studies reveal no strong reasons for implementing a slide
show surrogate, rather than a storyboard, a slide show was one of the several surrogates included in the Fischlár digital video library
(Lee & Smeaton, 2002).
If the slides in a slide show are shown at high speeds, they become fast forward surrogates. While the slide show research
indicated that users can perform some tasks more effectively at slow display speeds, some other studies pushed display speeds much
higher with only minimal decreases in user performance on a variety of tasks. Srinivasan et al. (1999) incorporated fast video
playback as a surrogate, adaptively sampling keyframes in order to construct a surrogate that emphasized the high-motion sections of
the full video. Some initial studies of fast forward surrogates found that some users preferred them (Wildemuth et al., 2002) and that
they were able to use the people, objects, settings, and actions/events in the fast forwards to determine the gist of the full video (Yang
& Marchionini, 2005). Wildemuth et al. (2003) used fast forward surrogates created through uniform sampling of the video's key-
frames to evaluate user performance at four different speeds, up to 256 times normal speed. User performance decreased as the fast
forward speed increased; 64 times normal speed was optimal due to the steep drop in performance at 128 times normal speed.
Peker and Dikavaran (2003) and Cheng, Luo, Chen, and Chu (2009) each developed fast forward surrogates that vary the playback
speed based on the motion activity in the original video. Most recently, Joshi, Kienzle, Toelle, Uyttendaele, and Cohen (2015) used a
time lapse metaphor to develop fast forward surrogates for a variety of video genres and platforms. Westman (2010) compared user
performance with two fast forwards one at 16 time normal speed and one in which the speed was user-controlled) and two other
surrogates; while user performance with fast forwards was no better than with static summaries, users preferred them and appre-
ciated the ability to control the play speed.
An alternative visualization that relies on the sequence of the video is a timeline representation of it. The timeline is usually
represented as a horizontal bar with a slider that the user can manipulate to move to a particular point in the video; the user can then
play the video beginning at that point. Some current systems include buttons for skipping back or forward by 10 s (e.g., the American
Archive of Public Broadcasting and the PBS Video collection). Various approaches to the scalability of the slider (see Hürst, 2006;
Matejka, Grossman, & Fitzmaurice, 2013) have been examined, such as an elastic timeline (Higuch, Yonetani, & Sato, 2017; Hürst,
Götz, & Jarvers, 2004). Other researchers have augmented the timeline with additional data such as the color distribution across
frames, motion in the original video, or other features of it (Del Fabro, Münzer, & Böszörmenyi, 2013; Moraveji, 2004; Schoeffmann,
Taschwer, & Boeszoermenyi, 2010) or the frequency of use of specific portions of the video (Haesen et al., 2013; He, Sanocki, Gupta,
& Grudin, 1999; Kim et al., 2014). Some current systems also include searchable transcripts that allow the user to easily skip to a
particular point in the video (e.g., such as ScienceCinema and the WGBH Open Vault). Finally, some work has been done on a
circular/clock layout for a timeline, rather than a horizontal bar (Haesen et al., 2013; Münzer & Schoeffmann, 2018).
3.3.3. Skims and other multimodal composites

The dynamic display of selected frames, as described above, takes advantage of the visual channel of video, but does not in-
corporate the audio channel. Approaches that do incorporate the audio channel are usually called video skims or multimodal sur-
rogates (Hu et al., 2011). Skims developed for use in the Informedia Digital Video Library at Carnegie Mellon University selected and
concatenated original video and audio data into new, shorter presentations (Smith & Kanade, 1998). The version that was found to be
most effective incorporated the best audio selections, synchronized with their video, in approximately 5-second segments. These
skims were effective in supporting both fact-finding and gist determination tasks (Christel, 2006; Christel et al., 1998). Further
development of skims includes both different methods for selecting the segments to include (e.g., Benini, Migliorati, and
Leonardi (2010) used motion and the presence of faces to identify relevant segments in surveillance videos) and playable storyboards
(Jackson et al., 2013) that give the user more control.
Other research teams also experimented with combining visual surrogates with audio. The CueVideo project (Srinivasan et al.,
1999) developed a motion storyboard. Keyframes were selected from the middle of each shot, and the audio track played while the
selected keyframes were displayed. The keyframe display rate was synchronized to the audio channel. The Open Video project
(Marchionini, Song, & Farrell, 2009; Song & Marchionini, 2007) also experimented with multimodal surrogates. One type of sur-
rogate played a spoken description of the video (included in the video metadata) while the fast forward surrogate was viewed; a
similar surrogate played a set of five spoken keywords (manually selected from those originally assigned to the video) while the fast
forward was viewed. This team also conducted experiments with spoken descriptions played during the display of storyboards. In all
these cases, the audio track was created with a text-to-speech synthesizer.
The efficiency of using both aural and visual channels should be an advantage of using multimodal surrogates, and the desire for
gaining efficiency has motivated experimentation with a variety of such surrogates. However, the empirical evaluations are mixed.
While some of these surrogates were able to support recognition, fact finding, and gist determination tasks, many users continued to
prefer reading text descriptions and keywords over hearing them. It is likely that the need for synchronization of the two channels is a
key factor affecting the effectiveness of multimodal surrogates. Thus, surrogates like skims or storyboards synchronized with the
audio track of the original video may still hold some potential for further development.
3.4. Summary
Surrogates for information objects have been used to represent those objects for centuries. Since the inception of computerized
information retrieval systems in the mid-20th century, surrogates for textual information objects have typically included author, title,
abstract, etc. Similar surrogates have also been used to represent other information objects, such as images and films/videos. Even
8
today, with the source material of digital video available for manipulation, YouTube and other online video collections use relatively
traditional surrogates: title, snippet, scrubbable timelines, popularity data, and a poster frame. However, opportunities for the de-
velopment of novel surrogates are available. These include moving image displays of various types, as well as multimodal surrogates
that include both visual and audio channels.
When evaluating the effectiveness of a particular surrogate, the goal of the interaction needs to be kept in mind. At times, a person
wants to use a surrogate to find a set of information objects for viewing, re-use or both; at other times, a person hopes that the
surrogate will help him or her to understand the essence of a particular information object. Both of these purposes are combined as a
person makes a relevance judgment about the information object based only on the surrogate. Thus, the agreement between a
relevance judgment based on examination of a surrogate and a relevance judgment based on examination of the full information
object is a strong indicator of the effectiveness of the surrogate.
4. Research questions
This study exploited this line of reasoning. It examined user interactions with several multimedia surrogates of digital video
objects and compared their effectiveness in supporting those users to make relevance judgments that would match their judgments
made about the full video. In addition, the time required to interact with the surrogates and users’ affective reactions to those
interactions were analyzed. Specifically, the research questions addressed in this study were:
• Which of the surrogates for video objects was most effective in supporting accurate relevance judgments?
• Which surrogate required the most time to use?
• Which surrogate provided the most positive user experience?
5. Methods
A sample of potential users of digital video collections was recruited to interact with four systems, each incorporating a different
surrogate in the list of retrieved results: a storyboard, a 7-second segment of the video, a fast forward of the video, and a combination
of all three of these surrogates. The 36 participants were each asked to complete one search task on each of four different systems.
They judged the relevance of at least 10 items for each search based on only the surrogate(s) provided; after completing these
relevance judgments, they re-judged the relevance of two items for each search based on viewing the full video. For each system, they
then rated its perceived usefulness, perceived ease of use, and provided subjective assessments of their experience of flow (con-
centration and enjoyment) while using the system. Transaction logs captured the participants’ interactions with the four systems,
including their relevance judgments and subjective ratings. The study methods are described in more detail below.
5.1. Participants
Study participants were recruited from among students, faculty and staff at the University of North Carolina at Chapel Hill (UNC-
CH) in 2004. Personal invitations were extended to faculty and staff previously identified as having an interest in video collections. In
addition, students in classes using video or concerned with video production were invited to participate. These groups were targeted
in order to recruit a sample that was already familiar with and had experience working with digital video. Each participant was
offered $20 as an incentive. In total, 36 study participants completed the study protocol.
5.2. Video collection/database
The Open Video Project10 supports a digital video repository that can serve as an open source test bed for the research and
educational communities. The repository points to about 2000 video segments and draws upon documentaries from many US
government agencies, the Prelinger Collection in the Internet Archive, digitized films in the Library of Congress’ American Memory
collection, and videos from CMU's Informedia Project and the University of Maryland's Human Computer Interaction Laboratory. The
preponderance of the collection is documentaries and educational programming and, given the assigned search tasks (see below), it is
likely that most of the retrieved videos would be of these genres. The MySQL metadata database is accessible from an interface that
provides overviews and previews (Geisler, Marchionini, Nelson, Spinks, & Yang, 2001) and serves as the testbed for the surrogates
developed and tested in the Interaction Design Laboratory at UNC-Chapel Hill.
A subset of the Open Video repository was defined for this study. It was first constrained by video length: it included only videos
that are 10 min or less (excluding the 1627 videos in the collection that last longer than 10 min). The purpose of this constraint was to
prevent too great a burden on participants when they were asked to view the full video for some of the items they retrieved (eight
videos, total). In addition, the study database included only those videos for which all three surrogates were available. Restricted in
these two ways, the database to be searched included about 380 digital videos.
In order to make the searching context realistic, all the search features included in the Open Video website were available for use
by participants. These included a basic search function, a detailed search function (allowing the user to search specific fields and limit
10
https://open-video.org/.
9
Fig. 1. Screen image from an Open Video search for “water”.
the search results by genre, duration, format, color/black & white, sound or not, language, and creation date). The search results
could be sorted by relevance, title, year, duration, or popularity. The default results list in all four systems included some traditional
surrogates: a poster frame and several textual elements (title, brief description, genre, keywords, duration and number of downloads).
Fig. 1 shows a typical display during system use.
When a video was selected for further consideration, the detailed results page was displayed. It included one or more surrogates
(depending on the system), as well as full metadata. In addition, this display asked the study participant to rate the relevance of the
video on a four-point scale: Not Relevant, Possibly Not Relevant, Possibly Relevant, or Relevant.
5.3. Surrogates
Three different surrogates were compared in this study: a storyboard, a 7-second segment of the video, and a fast forward of the
video. Each of these surrogates was represented in one of the systems with which participants interacted. The fourth system provided
all three surrogates and each participant could use any or all of the surrogates when interacting with that system.
The storyboard surrogate displayed up to 30 individual keyframes from the video, laid out in a grid that was six columns wide and
up to five rows high. The frames to be included were selected through a combination of automatic identification of keyframes and
manual selection of a small number of keyframes from those identified. When the storyboard surrogate was evoked, the entire set of
keyframes was visible simultaneously. The 7-second segment was created by selecting the first seven seconds of the video, after the
titles had finished. When this surrogate was displayed, it was played as if you were watching the actual video, i.e., at normal speed,
and included the video's audio track. The fast forward surrogate was created by sampling every 64th frame from the full video. These
frames were then displayed at a rate of 30 per second. Given this compaction rate, it would take less than 10 s to view the fast forward
surrogate for a 10-minute video.
5.4. Search tasks
Participants were asked to complete one search task for each of the four systems. The four search tasks varied on two dimensions:
visual/concrete vs. abstract/conceptual, and embedding task (find a video for viewing vs. find a video for production/reuse). The
tasks were:
■ Imagine that you are an emergency response officer in California and that you are developing an online tutorial on how to respond
to an earthquake. You would like to illustrate the tutorial with video showing the damage that can be caused by an earthquake.
(“earthquake”; visual/concrete; for production)
■ Imagine that you are a geography professor and are developing a presentation for your introductory class on the differing roles of
rivers. You'd like to show clips from recent videos (since 1990) of several different rivers. (“rivers”; visual/concrete; for viewing)
10
■ Imagine that you are a video enthusiast, having studied video production techniques since you were in your teens. You are
interested in creating a montage of a selection of the really early films from the Open Video collection that are most popular with
users of the site. (“old popular”; abstract/conceptual; for production)
■ Imagine that you are a history professor, teaching a course on the history of technology in the U.S. You want to find some footage
that illustrates America's growing obsession with cars/automobiles between 1930 and 1950. (“cars”; abstract/conceptual; for
viewing)
The first dimension on which the tasks varied was visual/concrete versus abstract/conceptual. The earthquake and rivers tasks
were considered concrete, since the video images would include concrete examples from real-world settings. The old popular and cars
tasks were considered more abstract. They were asking for videos that represented a particular style of video production or a
particular theme, but it is not clear which particular images might be contained in the videos judged relevant.
The tasks also varied in terms of the underlying purpose of the search: whether the video was to be obtained for viewing or for
reuse in production of a new video. This aspect of the search task was conveyed through a description of the context of the use of the
retrieved videos. The two tasks focused on viewing the videos (rivers and cars) were both in educational settings, with a teacher
seeking videos that could be shown/viewed in the classroom. The other two tasks focused on production settings. In one (earthquake),
the video was to be incorporated in a tutorial; in the other (old popular), the videos were to be assembled in a montage.
5.5. Study procedures
After giving informed consent, each person participated in an individual session, consisting of several types of questionnaires and
interactions with four variations of a search system. A pre-session questionnaire asked for some basic information about the parti-
cipant and his or her experience with video and video collections. It included several demographic questions (age, sex, status, and
department), questions on experience with computers, and questions on experience with video use and searching for videos. Next,
each participant was given a brief training tutorial on how to search the Open Video collection. The training tutorial was provided by
a member of the research team and instructed the participant in how to enter basic and detailed searches, modify those searches, and
browse the collection.
The next phase of the research protocol included interactions with the search systems. Each participant searched on all four tasks,
one task on each system variation. All items retrieved were shown in a summary list of results. Each participant was asked to view the
details on any item considered potentially relevant (accomplished by clicking on the item's title), for at least 10 items for each search.
The detailed view included one or more surrogates for the item (see the next paragraph). Based on viewing the surrogate, the
participant was asked to rate the item as Relevant, Probably Relevant, Probably Not Relevant, or Not Relevant.
Each of the four assigned searches was addressed to a different variation of the Open Video site. Each variation contained a
surrogate of a particular type. In other words, one contained only fast forward surrogates; one contained only storyboards; one
contained only 7-second segments; and one contained all three types of surrogates. The version that contained all three types of
surrogates was always assigned first; the order in which the other three variations were presented was counter-balanced. Immediately
after each search, the user completed questionnaires on perceived usefulness and perceived ease of use (Davis, 1989), and flow
(enjoyment and concentration) (Ghani et al., 1991). These questionnaires have been used and validated in prior studies.
Next, the participant made a second relevance judgment about each of two videos selected from those already rated. This time,
the relevance judgments were based on viewing the full video. The two videos to be judged were randomly selected from those
originally rated as either Probably Relevant or Probably Not Relevant to the search task.
After completing all four searches and their accompanying questionnaires and additional relevance judgments, the participants
completed a brief post-session questionnaire, asking them to compare the system variations (i.e., different surrogates) with which
they interacted.
All questionnaire responses were automatically captured and transferred to a database. The search transactions and relevance
judgments, as well as the time taken to make the various relevance judgments with each surrogate, were captured automatically. In
the system in which all three surrogates were available, the logs captured which surrogate(s) were viewed.
5.6. Data analysis
The primary research question relates to the relative effectiveness of the three different surrogates, in terms of (1) their ability to
support users in making relevance judgments (accuracy of those judgments and the time required to make them) and (2) users’
affective responses to them (perceived usefulness, perceived ease of use, flow, satisfaction). Analysis of variance, with appropriate
post hoc tests, was used to evaluate differences on these dimensions when the data would support such analyses; otherwise, chi
square tests on contingency tables were used to compare the effects of the surrogates.
Because the research design was attempting to balance a desire for naturalistic interactions with the surrogates and control over
variation of those interactions, participant behaviors were not completely controlled. While participants had access to only one
surrogate on three of the systems used, they could use any or all of the surrogates on one of the systems. In addition, they could also
view the textual metadata and a poster frame on any of the four systems. In the data analyses comparing surrogates, the use of the
multiple surrogates to make a particular relevance judgment was coded as “multiple”. If the participant did not view any of the
project-developed surrogates, using only the textual metadata and poster frame, the surrogate was coded as “text and poster frame”.
The data analyses used different units of analysis for different evaluation criteria. For the analysis of relevance judgment accuracy,
11
the data set consists of those surrogate-based relevance judgments for which the same user had made a “final” relevance judgment
(i.e., a judgment based on viewing the full video) for the same video in relation to the same task. This data set consisted of 275 pairs
of judgments.11 The same data set was used to analyze the amount of time taken to make the initial relevance judgments. In addition,
the set of all initial relevance judgments made with a single surrogate was analyzed for efficiency; this data set consisted of 1457
initial judgments.
To analyze the users’ affective responses to the surrogates, each user's rating of each surrogate was the unit of analysis. Each of the
36 study participants rated each of the four surrogates (each of the three surrogates plus the set of surrogates in combination),
resulting in 144 ratings. The final questionnaire also asked participants to provide comparative evaluations of the surrogates; there
were 36 of these evaluations.
5.7. Limitations of the methods
In designing this study, we attempted to balance the desire to control the independent variables with the desire to make parti-
cipation a naturalistic experience. This attempt at balance necessarily involved some compromises. In order to make the searches as
realistic as possible, we assigned simulated task scenarios, as suggested by Borlund (2003); even so, they were assigned tasks, rather
than tasks generated by the study participants. We allowed the user to have access to the full capabilities of the Open Video re-
pository, in order to make the search interactions as natural as possible; however, this meant that they would be interacting with the
textual descriptions and poster frames for the video objects, as well as the multimedia surrogates. In addition, in the system version
that incorporated all three of the multimedia surrogates, the participants could choose to view one, two, or three of them, and could
view any of the surrogates multiple times. While these compromises in the study design were difficult to make, we believe that we
were successful in finding a balance between our desire to observe “natural” behaviors and our desire to control the study variables.
6. Results
6.1. Characteristics of the study participants
Responses to an initial demographic questionnaire showed that the participants included 19 women and 17 men, and that their
mean age was 25.7 years, ranging from 18 to 58. Seventeen of them were graduate students and 19 were undergraduate students.
They were affiliated with 17 different departments or schools; the most frequently represented fields were journalism and mass
communication (8), information and library science (6), business (4), and political science (3). All participants reported using
computers daily. They were also asked about their use of and searching for videos (see Table 1). Most participants (75%) watch
videos daily or weekly. However, they search for videos less frequently. When they do search for videos, 31 reported going online, ten
use a film archives, six use newspapers or magazines, and six use other means of searching. They search by title (28), topic (11),
author or actor (9), trailer (4), or by other data elements (1).
6.2. Accuracy of initial relevance judgments
The initial relevance judgments based on only the surrogates were compared to the “final” judgments based on viewing the full
video. Thus, for this analysis, 275 pairs of judgments were considered. If the pair of judgments matched, the initial judgment was
evaluated as accurate; the larger the difference between the initial and final judgments, the less accurate the initial judgment. Table 2
shows the shifts in judgment between the initial judgments and the final judgments.12 Negative values indicate a shift in which the
final judgment was more negative than the initial judgment; positive values indicate a shift toward considering the video more
relevant for the task. For example, if the initial judgment was Possibly Relevant and the final judgment was Relevant, the value
indicated in Table 2 would be +1.
For 81 (30%) of the judgments, the initial relevance judgment matched the final judgment (based on viewing the full video). For
156 (57%) of the judgments, the final judgment was one point different than the initial judgment; of these, 44% of the final judg-
ments considered the video to be less relevant than initially judged and 56% considered the video more relevant than initially judged.
Of the remaining judgments, 37 (13%) were rated two points different at the final judgment and one was rated three points different
(initially judged Not Relevant and judged Relevant based on the full video). Of these 38, 55% shifted toward considering the video
less relevant and 45% shifted toward considering the video more relevant. A chi-square analysis of these frequencies did not reveal
any statistically significant effect of the surrogate on initial judgment accuracy (chi-square(16) = 9.257, p = 0.902). There was also
no statistically significant effect of the surrogate viewed on the absolute size of the shifts made (F(4) = 0.541, p = 0.706).
An alternative way to evaluate the accuracy of the initial relevance judgments is to examine whether the final judgment con-
firmed the initial judgment (e.g., an initial judgment of Probably Relevant paired with a final judgment of Relevant) or challenged the
11
In 28 cases, only the textual metadata and poster frame were viewed prior to making an initial relevance judgment. Thus, in the comparisons of
initial judgment accuracy and time taken to make an initial judgment, an additional surrogate type (“text and poster frame”) was included in the
analysis. There were 42 cases in which multiple surrogates were viewed; these were included in the analysis as a single “type” of surrogate.
12
One initial judgment of Not Relevant made with text and poster frame only was changed to Relevant after viewing the full video. Thus, the shift
was +3. For all statistical analyses, this case was grouped with the +2 shifts.
12
Table 1
Experience with video (n = 36 participants).
Never Occasionally Monthly Weekly Daily
How often do you watch videos or films? 0 3 6 23 4

How often do you search for videos or films? 2 16 5 12 1
Table 2
Accuracy of initial relevance judgments, by surrogate.
Surrogate Differences between initial and final judgments
−2 −1 0 +1 +2a Total
Text and poster frame 1 6 5 5 3 20

Storyboard 5 13 15 23 3 59
Fast forward 4 19 26 26 4 79
7-second segment 6 17 25 22 5 75
Multiple surrogates 5 13 10 12 2 42
Total 21 68 81 88 17 275
a
One shift of +3 is included in this column; that initial judgment was based on viewing only textual metadata and a poster frame.
initial judgment (e.g., an initial judgment of Probably Relevant paired with a final judgment of Probably Not Relevant. The 194 shifts
between the initial and final relevance judgments were fairly evenly split between those changes that confirmed the initial judgment
and those that challenged the initial judgment (see Table 3). A chi-square analysis indicated that the effect of surrogate type on
whether a shift confirmed or challenged the initial judgment was not statistically significant (chi-square(8) = 3.265, p = 0.917).
Additional analysis, based on the amount of shift after using either one or multiple surrogates for the initial rating, indicated that
the effect of number of surrogates on the shifts made was not statistically significant (F(1) = 1.961, p = 0.163). There was also no
statistically significant effect of the number of surrogates viewed on the absolute size of the shifts made (F(1) = 0.960, p = 0.328).
ANOVA was also used to investigate the effects of the assigned tasks on initial judgment accuracy. Each of the four tasks had been
categorized as concrete/visual or abstract/conceptual, and their purposes were categorized as for viewing or for production. There
were no statistically significant differences in judgment accuracy associated with the concrete/abstract dimension of the tasks (F
(1) = 2.442, p = 0.119), the purpose of the task (F(1) = 0.387, p = 0.534), or the interaction between these two aspects of the tasks
(F(1)=0.759, p = 0.384). There were also no statistically significant differences in the absolute values of the shifts associated with
the concrete/abstract dimension of the tasks (F(1) = 0.857, p = 0.355), the purpose of the task (F(1) = 0.073, p = 0.788), or the
interaction between these two aspects of the tasks (F(1) = 1.879, p = 0.172).
The possibility that individual differences across the participants might have affected the accuracy of their judgments was also
investigated. This effect was found to be statistically significant (F(35) = 1.516, p = 0.038). A post hoc analysis indicated a high
degree of overlap across the participants. However, four groups can be distinguished. Keeping in mind that each participant made
eight pairs of initial and final relevance judgments, we found that, for ten of the participants, the average shift was negative (ranging
from −1.13 to −0.12); for four participants, the average shift was zero (i.e., the negative and positive shifts balanced each other); for
ten participants, the average shift was positive but small (ranging from 0.11 to 0.17); and for the remaining twelve participants, the
average shift was positive and larger (ranging from 0.25 to 0.88). Further investigation of the effect of individual differences on initial
relevance judgments may be warranted.
6.3. Time required to make initial relevance judgments
Even if the accuracy of the judgments made with each surrogate was the same as the accuracy made with other surrogates, a
surrogate might be judged superior if the judgments could be made more quickly. Therefore, the time required to make initial
Table 3
Comparison of confirming versus challenging shifts in relevance judgments.
Confirming Matching Challenging Total
Text and poster frame 8 (40%) 5 (25%) 7 (35%) 20 (100%)

Storyboard 26 (44%) 15 (25%) 18 (31%) 59 (100%)
Fast forward 29 (37%) 26 (33%) 24 (30%) 79 (100%)
7-second segment 31 (41%) 25 (33%) 19 (25%) 75 (100%)
Multiple surrogates 17 (41%) 10 (24%) 15 (36%) 42 (100%)
Total 111 (40%) 81 (30%) 83 (30%) 275 (100%)
Note: Shifts confirming the initial judgment include those from Possibly Relevant to Relevant or from Possibly Not Relevant to Not Relevant. All
other shifts (excluding cases where the initial judgment matched the final judgment) were considered challenges to the initial judgment.
13
Table 4
Time required to make initial relevance judgments, by surrogate, for cases in which final judgments were also made.
Surrogate N Mean (in seconds) Standard deviation
Text and poster frame 20 4.9 3.1

Storyboard 59 11.1 6.8
Fast forward 79 17.0 10.4
7-second segment 75 16.3 8.7
Multiple surrogates 42 30.6 14.7
Total 275 16.8 11.9
relevance judgments with each surrogate was evaluated.

The first analysis was conducted using the same 275 judgment pairs discussed in the previous section. The mean times for making
the initial judgments with each surrogate are shown in Table 4. There is a statistically significant difference between surrogates, in
terms of their efficiency (F(4) = 33.521, p = 0.000). Post hoc analyses (Bonferroni t-tests) indicated that the surrogates fell into four
groups, in terms of the amount of time that people interacted with them. The use of only text and poster frame was quickest; use of
the storyboard was next; the fast forward and 7-second segment surrogates were next; and participants spent the most time when
viewing multiple surrogates. These findings are not surprising, given the relative size and required play time of the surrogates.
A parallel analysis was also conducted on the entire 1457 initial relevance judgments made by the 36 study participants. The
results of this analysis are shown in Table 5. There is a statistically significant difference between surrogates, in terms of their
efficiency (F(4) = 108.830, p = 0.000). Post hoc analyses (Bonferroni t-tests) revealed the same pattern seen with the smaller sample
of paired judgments presented above.
The effects of the tasks on the amount of time spent interacting with the surrogates were also evaluated (see Table 6). The main
effect of the concreteness of the task was not statistically significant (F(1) = 0.529, p = 0.467), nor was the interaction effect (F
(2) = 1.344, p = 0.247). However, the main effect of the task purpose was statistically significant (F(1) = 5.544, p = 0.019). Post
hoc analyses (Bonferroni t-tests) indicated that time taken to make relevance judgments associated with selecting videos for viewing
(17.2 s) was greater than the time taken for selecting videos for production purposes (15.4 s).
6.4. Relationship between accuracy and time
The relationship between accuracy and time was examined through analysis of variance, comparing the mean time taken for each
level of accuracy (i.e., the size of the shift in judgments, ranging from −2 to +2). Overall, the mean time spent to make the initial
judgment did not have an impact on its accuracy (F(4) = 1.863, p = 0.117; see Table 7). Comparable results were obtained when the
effects of the absolute values of the shifts on the time spent were analyzed (F(2) = 0.792, p = 0.454).
6.5. User perceptions of the surrogates
As described above, the study participants were asked about their perceptions of each of the four surrogates, as represented in the
four systems they used during the study protocol (storyboard only, fast forward only, 7-second segment only, and all three surrogate
types available). They rated the systems in terms of perceived usefulness, perceived ease of use, and two measures of flow: enjoyment
and concentration. Thus, there were 36 ratings of each of the four system versions on each of four attributes. Before examining the
data further, the reliability of each of the measures was investigated. The scales for perceived ease of use and perceived usefulness
each consisted of six ratings; the internal consistency (Cronbach's alpha) for these two measures was 0.88 and 0.96, respectively. The
measures of enjoyment and concentration each consisted of four ratings; the internal consistency (Cronbach's alpha) for these two
measures was 0.95 and 0.93, respectively. Thus, the analysis could proceed with confidence in the reliability of these measures of
user perceptions.
The mean ratings for perceived usefulness and perceived ease of use for each system are shown in Table 8. Note that, for these
analyses, the object of interest is the assigned system rather than the surrogate(s) actually viewed; thus, there are no rows in the table
for text and poster frame only or multiple surrogates. There were no statistically significant differences in these ratings (for perceived
usefulness, F(3) = 0.430, p = 0.732; for perceived ease of use, F(3) = 0.242, p = 0.867). The ratings of enjoyment and concentration
Table 5
Time required to make initial relevance judgments, by surrogate, for all cases.
Surrogate N Mean (in seconds) Standard deviation
Text and poster frame 116 6.8 7.9

Storyboard 334 11.3 8.8
Fast forward 408 15.1 11.1
7-second segment 405 16.1 10.3
Multiple surrogates 194 33.4 25.7
Total 1457 16.3 15.0
14
Table 6
Time required to make initial relevance judgments, by task (n = 1457).
Task N Mean (in seconds) Standard deviation
“earthquake” (concrete, for viewing) 360 15.5 13.2

“rivers” (concrete, for production) 367 16.5 17.8
“old popular” (abstract, for production) 372 15.2 11.2
“cars” (abstract, for viewing) 358 18.0 16.9
Viewing: “rivers,” “cars” 725 17.2 17.4

Production: “earthquake,” “old popular” 732 15.4 12.2
Abstract: “old popular,” “cars” 730 16.6 14.4

Concrete: “rivers,” “earthquake” 727 16.0 15.7
Table 7
Time spent making initial relevance judgments, by amount of shift between initial and final judgments.
Amount of change in judgments N Mean (in seconds) Standard deviation
Shift = −2 21 23.3 17.3

Shift = −1 68 16.4 11.3
Shift = 0 (no change) 81 16.4 10.6
Shift = +1 88 16.4 11.8
Shift = +2a 17 14.2 10.7
Total 275 16.8 11.9
a
The one case where the shift was +3 is grouped with those cases where the shift was +2.
Table 8
Participant ratings of perceived usefulness and perceived ease of use.
Perceived usefulness Perceived ease of use Enjoyment Concentration
Surrogate/system Mean s.d. Mean s.d. Mean s.d. Mean s.d.
Storyboard 2.2 1.1 3.4 1.5 2.0 0.9 3.4 1.5

Fast forward 2.1 1.0 3.0 1.4 1.9 0.8 2.7 1.3
7-second segment 2.4 0.9 3.0 1.2 2.1 0.8 3.1 1.0
All versions available 2.1 1.1 3.3 1.1 2.1 0.9 3.2 1.2
Note: Ratings on perceived usefulness and perceived ease of use were from 1 to 5. Ratings on enjoyment and concentration were from 1 to 7. For all
four measures, lower ratings indicate more positive perceptions.
are also shown in Table 8. There were no statistically significant differences in the ratings for either (enjoyment, F(3) = 0.989,
p = 0.400; concentration, F(3) = 1.799, p = 0.150).
In addition to the ratings collected after using each of the four surrogates, a final questionnaire, administered after all four
surrogates had been used, also asked for direct comparisons of the surrogates. When asked, “How different did you find the systems
from one another?,” eight participants responded that they were completely different from each other, two that they were not at all
different, and 26 that they were somewhat different from each other. The participants were also asked which of the four systems was
easiest to learn to use, easiest to use, and best overall. Their responses are shown in Table 9.
When asked, “Which of the four systems did you find easier to learn to use?,” the responses were not evenly distributed (one-
sample chi-square, p < 0.001). Fully half of the respondents found no difference between the surrogates, in terms of how difficult it
was to learn to use them, and a third of the respondents found the multiple-surrogate system easiest to learn to use. When asked,
“Which of the four systems did you find easier to use?,” almost half of the respondents chose the multiple-surrogate system, a
Table 9
Participant comparisons of the four surrogates.
Surrogate/system version Easiest to learn to use Easiest to use Best overall
Storyboard 1 4 1
Fast forward 2 7 16
7-second segment 3 4 3
Combination of 3 surrogates 12 16 16
No difference 18 5 0
Note: The number in each cell represents the number of study participants selecting that surrogate/system in response to each question.
15
statistically-significant effect (one-sample chi-square, p = 0.006). The remainder of the respondents were spread fairly evenly among
the other choices, including that there was no difference between the systems. Finally, when asked, “Which of the four systems did
you like the best overall?,” two of the systems were favored (one-sample chi-square, p < 0.001): the system including only the fast
forward surrogate and the system including all three surrogates.
In the final questionnaire, participants were also asked three open-ended questions:
• The search results were displayed with both text and images. Which aspect of the display was most useful to you, and why?
• What did you like about each of the systems?
• What did you dislike about each of the systems?
Their responses were consistent with the perceptions of the surrogates they had expressed quantitatively, and also provided some
of the reasons for their ratings. A few key points from these comments are summarized here. In general, participants found the textual
metadata to be more useful than the image-based surrogates in making their relevance judgments. They noted that the text tells them
“what [the video] would be about” (P14) and “summarizes the main theme” of the video (P42). Specifically, one participant noted
that she could “match [her] query to the terms in the text to judge the [video's] relevance” (P31). Many of the participants mentioned
that they valued the text's ability to summarize the overall content or theme of a video. Even so, some participants pointed out that it
was quicker to review the image-based surrogates and “the pictures… make it easier to remember something about each film” (P38).
About one-fourth of the participants mentioned the complementarity of the textual and image-based surrogates, with a few indicating
that they scan the images first, then check for relevance by reading the text.
Participants also commented on what they liked and disliked about the individual surrogates being evaluated. The storyboard was
appreciated because it “provided a good overview of the entire video” (P15). The structured layout of the storyboard also added value
and was perceived by the participants as an “organized” way to view the images (P34). While many commented on the compre-
hensiveness of the storyboard, a few noted that the storyboard surrogates did not include enough images, the images were too small,
and the still images did not provide a sense of the movement/motion in the video. Participants found the storyboard surrogate easy
and quick to use, and particularly helpful for “looking for particular images” (P35). Opinions about the 7-second segment were not as
positive. While a few participants noted that it provided a good introduction to the video (e.g., P5), most believed that it does not
provide sufficient information: it “didn't give enough info about what the film contained” (P35). Some participants were also con-
cerned about the selection of the 7-second segment. They noted that a particular segment “could easily give a user the wrong idea
about a video” (P43) and described it as “misleading” (P5) or “deceiving” (P3). One suggested remedying this situation by allowing a
user to launch a 7-second segment from the starting point of any of the stills included in the storyboard (P12), an approach in-
corporated in the video skims created by the Informedia project. The primary advantages of the 7-second segment were its in-
corporation of the audio stream from the segment and its ability to give the user “an idea about the overall tone and feel of the video”
(P13). The fast forward surrogate was favored by almost half of the participants (as shown in Table 9, above); their comments about it
reflected their positive attitudes toward it. They valued its ability “to summarize the whole film” (P3) and noted the way that it
supported very efficient interactions with the video. As one participant noted, “I got to see every frame the film contained without
having to waste 9 or 10 min by watching the whole thing. I could decide more completely whether this film would be relevant in a
very short time” (P38). On the other hand, a number of participants found the speed of the fast forward surrogate to be too quick and
would have liked to be able to control the play speed (e.g., P15) or “slow down the action” (P3). Participants also noted that this
surrogate did not contain an audio channel. The other preferred system was the one that included all three surrogates. As one person
noted, “it improves the quality of the search when all three [surrogates] are used together” (P26). Several noted that use of multiple
surrogates took longer, but valued “the opportunity to choose” which surrogate(s) to view (P7). Using this system, “there was much
less guessing involved… It was usually relatively simple to see whether a video would be appropriate or not” (P21). A few also noted
that choice of surrogate depended on the task, providing specific examples of how their choice of surrogate was dependent on which
task they were completing.
An additional indicator of user preferences for the different surrogates was available through the user interactions with those
surrogates when they were given a choice. In the system that included all three surrogates, a study participant could select one, two,
or three of those surrogates to view before making an initial judgment (or could avoid using any of them and rely only on the textual
descriptions and poster frame; these 28 cases were not included in this analysis). Participants’ choices were interpreted as indicators
Table 10
Frequency of selection of each surrogate when all were available.
Surrogate/system Number of times selected Percent
Storyboard 187 29.3%

Fast forward 232 36.3%
7-second segment 220 34.4%
Total 639 100.0%
Note: Of the 1457 initial relevance judgments made in the study, there were 368 made with the system
that provided access to all three surrogates. Of these, 194 involved the use of multiple surrogates. Thus,
the total for this table is greater than 368. Twenty-eight judgments relied only on textual metadata a
poster frame; they were not included in this analysis.
16
Table 11
Effects of task on surrogate selection.
Task Storyboard Fast forward 7-second segment
“earthquake” 52 77 55
“rivers” 68 80 53
“old popular” 32 27 67
“cars” 35 48 45
Viewing: “rivers,” “cars” 103 128 98

Production: “earthquake,” “old popular” 84 104 122
Abstract: “old popular,” “cars” 67 75 112

Concrete: “rivers,” “earthquake” 120 157 108
Note: The number in each cell represents the number of times a particular surrogate/system was selected in response to a particular task or type of
task.
of their preferences, and are shown in Table 10. There were no statistically significant differences between the surrogates, in terms of
how frequently they were selected by study participants (one-sample chi-square test, p = 0.078).
It was found that the search task did have a significant effect (chi-square(6) = 30.039, p < 0.001) on the choice of surrogate used
to make an initial relevance judgment, when all three surrogates were available. An examination of the standardized residuals
indicates that participants selected the 7-second segment more often than expected and the fast forward surrogate less often than
expected for the “old popular” task. In addition, the type of task had an effect, both for the viewing vs. production types (chi-square
(2) = 6.472, p = 0.039) and the abstract vs. concrete types (chi-square(2) = 17.976, p = 0.000). An examination of the standardized
residuals indicated that participants selected the 7-second segment more often than expected for the production tasks and less often
than expected for the viewing tasks. A similar analysis indicated that participants selected the 7-second segment more often than
expected for the abstract tasks and less often than expected for the concrete tasks. The selection frequency for each task and for the
types of tasks are shown in Table 11.
6.6. Effects of individual characteristics on interactions with surrogates
Results from the initial demographic questionnaire were used to investigate the possible effects of a participant's age, sex, student
status, and experience with video viewing and searching on the accuracy of the initial relevance judgments, the time taken to make
those judgments, and ratings of user perceptions of the surrogates and interactions with the surrogates. No effects on judgment
accuracy, time taken, or user ratings of the surrogates were found. The analysis did identify two statistically significant effects on user
selection of which surrogate to view when all were available (i.e., when using the system that included all the surrogates). In these
analyses, as in the analysis of this outcome variable above, the 28 cases where judgments were based solely on the textual metadata
were excluded.
First, age was found to be related to the choice of which surrogate(s) to view when multiple were available (chi-square
(4) = 11.134, p = 0.025). As in previous analyses, the surrogates viewed when using the system that included all the possibilities
were taken to mean that the user preferred those surrogates. For this analysis, the participants were divided, by age, into three groups
that were roughly equal in size: 18–21 years old, 22–29, and 30–58. The results of the analysis are shown in Table 12. An examination
of the standardized residuals indicated that the oldest age group selected the storyboard surrogate more often than expected.
Second, the frequency of video viewing was related to the surrogate selected when multiple surrogates were available (chi-square
(6) = 22.745, p = 0.001). These results are shown in Table 13. An examination of the standardized residuals indicated that those
watching videos only occasionally chose to use the fast forward surrogate more often than expected; those watching videos monthly
chose to use the 7-second segment more often than expected.
Table 12
Effects of age on surrogate preference/use.
Age groups
Surrogate 18–21 (n = 16) 22–29 (n = 12) 30–58 (n = 8) Total
Storyboard 79 57 51 187
Fast forward 107 85 40 232
7-second segment 109 78 33 220
Total 295 220 124 639
Note: The n's in the column headings indicate the number of participants in each age group. The number in each cell represents the number of times
study participants in a particular age group selected that surrogate while using the system that provided all three surrogates.
17
Table 13
Effects of video viewing frequency on surrogate preference/use.
Frequency of video viewing
Surrogate Occasionally (n = 3) Monthly (n = 6) Weekly (n = 23) Daily (n = 4) Total
Storyboard 9 11 142 25 187

Fast forward 26 18 153 35 232
7-second segment 11 35 141 33 220
Total 46 64 436 93 639
Note: The n's in the column headings indicate the number of participants in each group. The number in each cell represents the number of times
study participants in a particular group selected that surrogate while using the system that provided all three surrogates.
7. Discussion
The purpose of this study was to investigate the role of various surrogates in the process of making relevance judgments about
videos in a digital video library. In particular, we were concerned with evaluating whether particular surrogates were more or less
capable of supporting users in making accurate relevance judgments, defined as matching the judgment they would make when
viewing the entire video. One would expect that the design of the ideal surrogate would balance the need to provide as much
information as possible and the need to keep users’ interactions with it as efficient and satisfying as possible.
In this study, three surrogates were evaluated. The storyboard provides a small number of still images selected from the main
scenes in the video, thus providing a very compact summary of the full video. The 7-second segment, selected from the beginning of
the video, provides both audio and video of that small piece of the full video. The fast forward surrogate provides a view of a large
number of images from the video, played in a way that simulates the action of the video; no audio is provided. These surrogates were
compared with each other and with a system that incorporated all three. To evaluate which of these surrogates might be most useful
in supporting users’ relevance judgments, the study design attempted to balance experimental control with the need for ecological
validity. All participants completed the same tasks (control exerted) but were allowed to interact with the surrogates freely (nat-
uralistic). The results are summarized and discussed here.
7.1. Accuracy of initial relevance judgments
Based on an analysis of 275 pairs of initial and final relevance judgments, we found that all three surrogates supported users in
making accurate relevance judgments equally well (i.e., no statistically significant differences between surrogates were found). Thirty
percent of the initial judgments matched the final judgments; 57% moved up or down one point on the relevance scale; and 13%
moved two points. A primary reason for this lack of differences is likely the amount of design effort that went into these surrogates, all
of which had been evaluated in previous studies (Hughes et al., 2003; Wildemuth et al., 2002, 2003). Thus, this finding validates the
development and evaluation methods used, since all the surrogates were found to be effective in the current study's relatively
naturalistic context. Another methodological consideration is that textual metadata and a poster frame were available to users as they
viewed each of the surrogates; in other words, each of the surrogates included a poster frame and textual information such as the
video title, keywords, year of production, and run time. Users’ comments on their use of the textual metadata when evaluating the
relevance of the videos support this explanation. The decision to include these metadata in the viewable records was intended to
make the experimental context more realistic (i.e., ecologically valid), but it may have also decreased the differences between the
surrogates. Future research might compare a baseline interface, providing just the poster frame and textual metadata, with interfaces
augmented with specific novel surrogates.
7.2. Time required to make initial relevance judgments
Overall, the mean time required to use one (or multiple) of the surrogates was just over 16 s. The surrogates did vary in the
amount of time required for these interactions, with text and poster frame being quickest (5–6 s), followed by the storyboard (11 s),
then the fast forward and 7-second segment surrogates (15–17 s), with multiple surrogates taking the longest (31–33 s). While
differences in the amount of time taken to interact with the different surrogates could be detected, these differences did not affect the
accuracy of the initial relevance judgments. Thus, we might conclude that providing only textual metadata and a poster frame or, at
most, a storyboard to support user interactions with digital video libraries would be the most efficient use of their time. However,
based on user comments, ratings of their perceptions, and the choices they made in interacting with the surrogates, the richer
surrogates should also be considered for inclusion in the interface of a digital video library. It is noteworthy that, even though
participants understood that using multiple surrogates would take longer, they still chose to view multiple surrogates when they had
the opportunity to do so and almost half of them selected the multiple-surrogate system as the one they preferred. If user preferences
are taken into account, multiple surrogates should be available.
The results indicate that the time spent interacting with surrogates also varied with the purpose of the assigned task: either
production (developing an online tutorial or creating a montage) or viewing (showing video clips to a class). The relevance judgments
associated with the two viewing tasks took slightly longer than those associated with the production tasks (17 versus 15 s). The
18
reasons for this finding warrant further investigation.
7.3. User perceptions of the surrogates
Participants’ ratings of the perceived usefulness and perceived ease of use of the different surrogates revealed no differences
between them. In addition, when users were able to choose which surrogate(s) to use for a particular task (i.e., when using the system
that included all three surrogates), they chose each at comparable levels. Thus, we can conclude that all the surrogates were per-
ceived as useful and easy to use. Similarly, participants judged their enjoyment of and concentration when using the different
surrogates the same across all three variations and the combination. On these scales, the ratings were moderately positive; given that
the system was new to all the participants, their lack of experience with it likely impeded a more positive flow experience.
When asked to make direct comparisons of the different surrogates, the participants were able to distinguish them. While half of
them reported no difference in the ease of learning to use the various surrogates, most of the others found the system version
providing multiple surrogates to be easiest to learn. Almost half the participants also found the multiple-surrogate system easiest to
use and best overall. While these positive assessment of the seemingly most-complex system are somewhat surprising, this finding is
consistent with that of Hürst and Dos Santos Carvalhal (2017), in which users of a rich visual browsing interface still desired
additional information about the TV shows included in the collection. Our participants’ open-ended responses to the post-session
questionnaire also corroborated their ratings; they described the multiple-surrogate system as providing more “visual and verbal
clues” (P10) and “multiple ways to… judge whether the video is relevant” (P31). They also appreciated the opportunity to select a
particular surrogate to use for a particular task. Their comments strongly suggest that digital video library interfaces should provide
multiple surrogates for each video.
The one individual surrogate that received the most positive expressions of user opinions was the fast forward. Almost half the
participants judged it the best overall. Its combination of providing an overview of the whole video and incorporating motion seemed
to please the participants; as one noted, “it seemed to combine the best elements of the other two systems” (P15). The one individual
surrogate that received a number of negative comments was the 7-second segment. While the participants appreciated the fact that it
incorporated the audio track from the segment, many expressed disappointment in the fidelity with which it represented the video.
They were unsure that the segment selected was the best representation of the full video and some described it as misleading or
deceptive. They also experienced a lack of control over which brief segment to view, since the system provided only one option. These
negative perceptions can lead us to conclude that this particular surrogate has been adequately replaced by the scrubbable timelines
available in current online video collections.
7.4. Methodological issues
As noted earlier, the current study attempted to balance the need to observe “natural” behaviors and the need to control the study
variables. We tried to achieve this balance in several ways. Only participants with prior experience with digital video were recruited,
so that the novelty of working with this medium would not affect the results; the participants can be viewed as “realistic” users of a
digital video library. The search system interface was the same as the “live” Open Video website interface, optimizing its naturalness;
however, the collection was truncated to include only relatively short (i.e., less than 10 min) videos that had all three types of
surrogates available. Limiting the collection to short videos minimized the burden on the study participants, who were asked to view
eight full videos; limiting the collection to those with all the surrogates exerted experimental control. In addition, the generalizability
of the study findings is limited to videos of a particular genre: documentaries and educational programs such as those included in the
Open Video repository. While it might be argued that the findings could apply to similar genres, it's less likely that they could apply to
very dissimilar genres like surveillance or lifelogging videos. The assigned tasks were written as simulated work task scenarios, with
the goal of making them naturalistic, as suggested by Borlund (2003). Even so, these four tasks represent only a few types of tasks that
might be supported by a video collection, and are only a sample of the task types investigated. Thus, the generalizability of the results
is limited to these tasks. While we still believe that we achieved an appropriate balance between experimental control and ecological
validity, other researchers might have made other decisions about the study design.
One of the decisions made in designing the study was related to the selection of the videos to be used as the basis for the final
relevance judgments to be made by each participant. Two videos were selected for each task for each participant, so each participant
viewed eight full videos and made relevance judgments based on that viewing. The two videos were selected from those for which the
participant had made an initial relevance judgment based on viewing one or more surrogates; each participant was asked to make ten
initial relevance judgments for each task, so two of the ten were selected for further viewing. We took a conservative approach to the
selection of these videos. In each case, videos were selected from those given a Possibly Relevant or Possibly Not Relevant initial
judgment, with the goal of selecting videos for which the participant's relevance judgment was most likely to shift; this strategy
avoided the inflation of the measure of relevance judgment accuracy. However, it may also have contributed to the lack of statis-
tically significant effects across surrogates by making the overall performance of all the surrogates appear to be less effective.
Individual differences in the process of making relevance judgments may also have affected the validity of our findings. The
results indicated that some participants were more likely to be accurate in their judgments (i.e., their final judgment based on viewing
the full video matched their initial judgment based on viewing only surrogates). This pattern may be due to differences in people's
abilities to make accurate relevance judgments from fewer cues, or it may be due to some people being better able to remember the
initial relevance judgment and repeating it. This problem is likely to occur for any study investigating similar research questions, and
so may warrant further investigation by applying think-aloud protocols or similar methods during the judgment process.
19
Finally, as noted earlier, all the surrogates were viewed in combination with a set of textual metadata and a poster frame, and this
may have caused us to underestimate the differences between surrogates. The decision to maintain the metadata in the viewable
records was intended to keep the search context naturalistic. Of the 1457 initial relevance judgments made across all four versions of
the system, 116 (8%) were based on text and poster frame only. To address this problem, most analyses excluded these cases; but
readers should also keep in mind that each surrogate was really the surrogate plus a poster frame and textual metadata.
8. Conclusion
The current study is unique in examining the role of various surrogates in people's interactions with a collection of digital videos.
It evaluates several different surrogates in terms of user performance as aided by those surrogates, as well as users’ affective responses
to their interactions with the surrogates. The study was motivated by both practical goals (i.e., to suggest ways to optimize the design
of surrogates to be included in large digital video collections) and theoretical goals (i.e., to better understand the ways in which
different surrogates might support people in making accurate relevance judgments). As Marchionini et al. (2006) pointed out, these
two types of goals form a Möbius strip of research and practice. In this section, the ways in which this study's findings address each of
these two sets of goals will be discussed.
8.1. Implications for design of video surrogates
The results of this study clearly suggest that a digital video library should provide multiple surrogates for the videos it holds.
Textual metadata continues to be important, even though the items in the collection are not textual in their form. Users will interact
with textual metadata in order to understand the gist of a video's content and in order to evaluate their search terms and get ideas for
additional search terms. Additional studies also found that users expect that textual metadata will be part of any high-quality digital
video collection (Albertson & Ju, 2015) and that they find it useful in making relevance judgments (Cunningham & Nichols, 2008).
A poster frame provides users with an initial sense of the look of a video, and has been incorporated in almost all current video
collections. In the current study, the poster frame and textual metadata were available along with all the other surrogates evaluated/
compared. In future evaluative studies of surrogates, the baseline of a poster frame and textual metadata should be treated as a
control and evaluated in comparison with the extra value added by other surrogates.
The results of the current study indicate that a storyboard surrogate should be included as a necessary augmentation of a single
poster frame (though very few current collections include this surrogate). It provides users with an image-based and structured
overview of the entire video. However, its static nature does nothing to support users’ knowledge of the tone or feel of the video and
so it is not sufficient. The 7-second segment is also not available in most current collections, but it can be argued that the current
generation of streaming video players supports an equivalent functionality. The findings from the current study suggest that the
storyboard and the 7-second segments can be fused, so that the images in the storyboard can be used to launch short segments of any
of the scenes in the video (e.g., as in Jackson et al., 2013; Westman, 2010). This fusion of the two surrogates would allow users to
examine specific sections of the video, including the audio track in those sections.
The fast forward surrogate was highly regarded by the participants in this study, so should also be included. Many participants
found it a very efficient way to get an overview of the video. While the fast forward surrogate lacks the audio track, it does simulate
the movement of the full video. Varying the speed of the playback based on video characteristics (Cheng et al., 2009; Peker &
Divakaran, 2003) or supporting user control of the playback speed (Westman, 2010) are avenues that may be further evaluated for
their effectiveness with users. Current systems do not provide this functionality.
In many current online video collections, a scrubbable timeline is provided, allowing a user to view any length of segment from
any point in the video. This type of surrogate was not evaluated in the current study, but its value might be anticipated based on our
findings. It does not provide the structured overview of a video that is provided by a storyboard, but it does provide access to short
segments of the video, so the user could sample a number of short segments during the relevance decision making process. While
quite useful in providing access to many videos, it is less useful for longer videos with richer content, since scalability is an important
issue in this context. A direct comparison of a scrubbable timeline and a playable storyboard in the context of documentary or
narrative videos would improve our understanding of the relative merits of each of these surrogates.
On its own, none of these surrogates is sufficient to fully support users’ decisions about the relevance of a particular video. Most
current systems provide a poster frame, textual metadata, and the ability to scrub a timeline of the video and/or search a transcript
for relevant portions of the video. However, they do not provide storyboards, skims, or fast forwards, so there are still many op-
portunities for significant improvement in supporting users’ relevance judgments. It is important to provide a broad range of sur-
rogates, since users employ different surrogates for different purposes and to address different tasks.
The data on surrogate efficiency suggest that providing textual metadata plus a storyboard may be sufficient and would achieve
the goal of supporting users in making accurate relevance judgments very quickly. However, the data on user perceptions of their
interactions with the surrogates clearly indicate that efficiency should not be our only design goal. It must be augmented with goals
related to the user experience. Based on the evidence provided by this study, users are very willing to sacrifice some efficiency in
order to be more confident in their understanding of the video under consideration and their selection of those videos most likely to
be relevant to their needs.
20
8.2. Implications for understanding the role of surrogates in user interactions
The focus of this study was on supporting people while they are making relevance judgments. From their comments during the
study, it is clear that the participants use different aspects of the surrogates in different ways to inform these judgments. For example,
one participant (P10) said, “I liked to use the fast forward feature when I was looking for action. I liked the seven second feature when
I was listening for dialogue related to the query.” It is likely that their preference for using the system that provided multiple
surrogates is a reflection of their nuanced (though possibly subconscious) understanding of their own reasoning while making re-
levance judgments.
In addition, our understanding of how people use surrogates in making relevance judgments is limited to the context of the types
of tasks included in this study. The assigned tasks varied in their concreteness and their purpose, but there are many other kinds of
tasks that might be supported through use of a digital video collection. To fully understand the roles that surrogates might play in
supporting people's relevance judgments, future studies should consider incorporating different task types.
Another type of data that is included in many current online video collections is social data, such as reviews and ratings from
other viewers/users. These data are not surrogates, so were not included in this study. However, it seems likely that they do play a
role in people's judgments about the relevance of particular videos for particular purposes. Thus, an investigation of the role of these
social data in people's judgments and the interaction between these data and the surrogates available is warranted.
As discussed in the literature review, surrogates also play a role in people's sense making. While not the focus of the current study,
the results do suggest that surrogates may play additional roles in sense making. For example, we did not ask the study participants to
actually create a montage of video clips to address the production tasks; if we had, it's likely that the various surrogates would have
supported them in different ways to complete this task. Gaining an increased understanding of the video from its surrogates can
provide a positive influence on a person's fuller use of that video. The role of surrogates, and differences across surrogates, in relation
to sense making is an area of future research that would be very fruitful.
Online collections of videos have been developed since these data were collected and are larger in scope than a few years ago.
Even so, the provision of useful and usable surrogates for interacting with these collections has not changed very much at all. To
undergird progress in developing and implementing more useful video surrogates, surrogate theory should continue to be an area of
research interest. By gaining a deeper understanding of the roles that surrogates play in people's interactions with information
objects, we can look forward to building more effective and satisfying digital video collections.
Declaration of Competing Interest
The authors have no competing interest to declare.
Acknowledgments
Additional assistance with the data collection and analysis were provided by Thomas Tolleson and Jie Luo. This work was
supported by National Science Foundation (NSF) Grant IIS 0099638.
References
Adcock, J., Cooper, M., Girgensohn, A., & Wilcox, L. (2005). Interactive video search using multilevel indexing. In W.-K. Leow, M. S. Lew, T.-S. Chua, W.-Y. Ma, L.
Chaisorn, & E. M. Bakker (Eds.). Proceedings of the 4th international conference on image and video retrieval (CIVR 2005) (pp. 205–214). . https://doi.org/10.1007/
11526346_24.
Al Maqbali, H., Scholer, F., Thom, J. A., & Wu, M. (2010). Evaluating the effectiveness of visual summaries for web search. 15th Australasian document computing
symposium (ADCS 2010) (pp. 36–43). . Retrieved December 1, 2017, from http://www.cs.rmit.edu.au/adcs2010/proceedings/pdf/adcs2010proceedings.pdf#
page=43.
Albertson, D. (2013). An interaction and interface design framework for video digital libraries. Journal of Documentation, 69(5), 667–692. https://doi.org/10.1108/JD-
12-2011-0056.
Albertson, D. (2016). A unified framework of information needs and perceived barriers in interactive video retrieval. Journal of Information Science Theory and Practice,
4(4), 4–15. https://doi.org/10.1633/JISTaP.2016.4.4.1.
Albertson, D., & Johnston, M. P. (2017). Not reinventing the “Reel:” Adaptation and evaluation of an existing model for digital video information seeking. Proceedings
of the annual meeting of the association for information science and technology. vol. 54. Proceedings of the annual meeting of the association for information science and
technology (pp. 10–17). . https://doi.org/10.1002/pra2.2017.14505401002.
Albertson, D., & Ju, B. (2015). Design criteria for video digital libraries: Categories of important features emerging from users' responses. Online Information Review,
39(2), 214–228. https://doi.org/10.1108/OIR-10-2014-0251.
Al-Hajri, A., Miller, G., Fels, S., & Fong, M. (2013). Video navigation with a personal viewing history. In P. Kotzé, G. Marsden, G. Lindgaard, J. Wesson, & M. Winckler
(Eds.). Human-computer interaction – INTERACT 2013: 14th IFIP TC 13 international conference, proceedings, part III (pp. 352–369). Springer. https://doi.org/10.
1007/978-3-642-40477-1_22.
Amir, A., Ponceleon, D., Blanchard, B., Petkovic, D., Srinivasan, S., & Cohen, G. (2000). Using audio time scale modification for video browsing. Proceedings of the 33rd
Hawaii international conference on system sciences, HICSS-2000 (Maui, HI, January, 2000) (pp. 1–10). . https://doi.org/10.1109/HICSS.2000.926728.
Aula, A., Khan, R. M., Guan, Z., Fontes, P., & Hong, P. (2010). A comparison of visual and textual page previews in judging the helpfulness of web pages. Proceedings of
the 19th international conference on world wide web (WWW’10) (pp. 51–59). . https://doi.org/10.1145/1772690.1772697.
Awad, G., Butt, A.A., .Fiscus, J., Joy, D., Delgado, A., McClinton, W. et al. (2018). TRECVID 2017: Evaluating ad-hoc and instance video search, events detection, video
captioning, and hyperlinking. Retrieved April 25, 2019, fromhttps://www-nlpir.nist.gov/projects/tvpubs/tv17.papers/tv17overview.pdf.
Awad, G., Butt, A.A., .Fiscus, J., Joy, D., Delgado, A., Michel, M. et al. (2017, November 9). TRECVID 2017: Evaluating ad-hoc and instance video search, events
detection, video captioning, and hyperlinking. TRECVID overview paper. Retrieved 12/19/2017, fromhttp://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.17.org.
html.
Bärecke, T., Kijak, E., Nürnberger, A., & Detyniecki, M. (2006). VideoSOM: a SOM-based interface for video browsing. In H. Sundaram, M. Naphade, J. R. Smith, & Y.
21
Rui (Eds.). Proceedings of the 5th international conference on image and video retrieval (CIVR 2006) (pp. 506–509). . https://doi.org/10.1007/11788034_55.
Barry, C. L. (1998). Document representations and clues to document relevance. Journal of the American Society for Information Science, 49(14), 1293–1303. https://doi.
org/10.1002/(SICI)1097-4571(1998)49:14%3C1293::AID-ASI7%3E3.0.CO;2-E.
Benini, S., Migliorati, P., & Leonardi, R. (2010). Statistical skimming of feature films. International Journal of Digital Multimedia Broadcasting, 2010, 709161. https://doi.
org/10.1155/2010/709161.
Bloomberg. (2016). Number of daily mobile video views on Snapchat as of April 2016 (in billions). Statista - The Statistics Portal. Retrieved October 17, 2017, from https://
www.statista.com/statistics/513494/snapchat-daily-video-views/.
Borko, H., & Bernier, C. (1975). Abstracting concepts and methods. New York: Academic Press.
Borlund, P. (2003). The IIR evaluation model: A framework for evaluation of interactive information retrieval systems. Information Research, 8(3), 152. Retrieved
November 30, 2018, from http://www.informationr.net/ir/8-3/paper152.html.
Buckland, M. K. (1997). What is a “document”? Journal of the American Society for Information Science, 48(9), 804–809. https://doi.org/10.1002/(SICI)1097-
4571(199709)48:9<804::AID-ASI5>3.0.CO;2-V.
Burke, M. (1999). Organization of multimedia resources. Hampshire, UK: Gower Publishing.
Capra, R., Arguello, J., & Scholer, F. (2013). Augmenting web search surrogates with images. Proceedings of the 22nd ACM international conference on information &
knowledge management (CIKM ’13) (pp. 399–408). . https://doi.org/10.1145/2505515.2505714.
Chen, Y., Wang, J., Liu, J., & Lu, H. (2015). Mobile media thumbnailing. Proceedings of the 5th ACM on international conference on multimedia retrieval (ICMR ’15) (pp.
665–666). . https://doi.org/10.1145/2671188.2749409.
Cheng, K.-Y., Luo, S.-J., Chen, B.-Y., & Chu, H.-H. (2009). SmartPlayer: User-centric video fast-forwarding. Proceedings of the SIGCHI conference on human factors in
computing systems (CHI ’09) (pp. 789–798). . https://doi.org/10.1145/1518701.1518823.
Christel, M., & Warmack, A. (2001). The effect of text in storyboards for video navigation. Proceedings of the 2001 IEEE international conference on acoustics, speech, and
signal processing (ICASSP '01) (pp. 1409–1412). . https://doi.org/10.1109/ICASSP.2001.941193.
Christel, M. G. (2006). Evaluation and user studies with respect to video summarization and browsing. Multimedia content analysis, management and retrieval, proceedings
of SPIE (pp. 196–210). . https://doi.org/10.1117/12.642841.
Christel, M. G. (2008). Amplifying video information-seeking success through rich, exploratory interfaces. In G. A. Tsihrintzis, M. Virvou, R. J. Howlett, & L. C. Jain
(Vol. Eds.), Studies in computational intelligence: . 142. New directions in intelligent interactive multimedia (pp. 21–30). Springer. https://doi.org/10.1007/978-3-540-
68127-4_2.
Christel, M. G., Smith, M. A., Taylor, C. R., & Winkler, D. B. (1998). Evolving video skims into useful multimedia abstractions. Proceedings of the ACM SIGCHI conference
on human factors in computing systems (CHI ’98) (pp. 171–178). . https://doi.org/10.1145/274644.274670.
Church, K., Smyth, B., & Keane, M. (2006). Evaluating interfaces for intelligent mobile search. Proceedings of the 2006 international cross-disciplinary workshop on web
accessibility (W4A): Building the mobile web: rediscovering accessibility? (pp. 69–78). . https://doi.org/10.1145/1133219.1133232.
Cobârzan, C., Schoeffmann, K., Bailer, W., Hürst, W., Blažek, A., Lokoč, J., et al. (2017). Interactive video search tools: A detailed analysis of the video browser
showdown 2015. Multimedia Tools & Applications, 76, 5539–5571. https://doi.org/10.1007/s11042-016-3661-2.
Cunningham, S., & Nichols, D. (2008). How people find videos. Proceedings of the joint conference on digital libraries (JCDL ’08) (pp. 201–210). . https://doi.org/10.
1145/1378889.1378924.
Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly, 13(3), 319–340. https://doi.org/10.
2307/249008.
Del Fabro, M., Münzer, B., & Böszörmenyi, L. (2013). Smart video browsing with augmented navigation bars. In S. Li, A. El Saddik, M. Wang, T. Mei, N. Sebe, S. Yan, R.
Hong, & C. Gurrin (Vol. Eds.), Lecture notes in computer science: . vol. 7733. Advances in multimedia modeling. Proceedings, part II, of the 19th international conference,
MMM 2013 (pp. 88–98). Springer. https://doi.org/10.1007/978-3-642-35728-2.
Dervin, B., & Frenette, M. (2001). Sense-making methodology: Communicating communicatively with campaign audiences. In R. E. Rice, & C. K. Adkin (Eds.). Public
communication campaigns (pp. 69–87). Thousand Oak, CA: Sage.
Dimitrova, N. (2003). Multimedia content analysis: The next wave. Lecture notes in computer science: . vol. 2728https://doi.org/10.1007/3-540-45113-7_2.
Ding, W., Marchionini, G., & Tse, T. (1997). Previewing video data: Browsing key frames at high rates using a video slide show interface. Proceedings of the international
symposium on research, development and practice in digital libraries (pp. 151–158). . Retrieved October 17, 2017, from http://www.dl.slis.tsukuba.ac.jp/ISDL97/
proceedings/weid/weid.html.
Ding, W., Soergel, D., & Marchionini, G. (1999). Performance of visual, verbal, and combined video surrogates. Proceedings of the annual meeting of the American Society
for Information Science (ASIS ‘99) (pp. 651–664). .
Divakaran, A., Forlines, C., Lanning, T., Shipman, S., & Wittenburg, K. (2005). Augmenting fast-forward and rewind for personal digital video recorders. 2005 digest of
papers: International conference on consumer electronics (pp. 43–44). . https://doi.org/10.1109/ICCE.2005.1429708.
Drucker, S., Glatzer, A., DeMar, S., & Wong, C. (2002). SmartSkip: Consumer level browsing and skipping of digital video content. Proceedings of the ACM SIGCHI
conference on human factors in computing systems (pp. 219–226). . https://doi.org/10.1145/503376.503416.
Dziadosz, S., & Chandrasekar, R. (2002). Do thumbnail previews help users make better relevance decisions about web search results? Proceedings of the 25th annual
international ACM SIGIR conference on research and development in information retrieval (pp. 365–366). . https://doi.org/10.1145/564376.564446.
eMarketer. (2018). Number of digital video viewers in the United States from 2012 to 2021 (in millions). Statista - The Statistics Portal. Retrieved October 17, 2017, from
https://www.statista.com/statistics/271611/digital-video-viewers-in-the-united-states/.
Enser, P. (2000). Visual image retrieval: Seeking the alliance of concept-based and content-based paradigms. Journal of Information Science, 26(4), 199–210. https://
doi.org/10.1177/016555150002600401.
Fan, J., Elmagarmid, A. K., Zhu, X., Aref, W. G., & Wu, L. (2004). ClassView: Hierarchical video shot classification, indexing, and accessing. IEEE Transactions on
Multimedia, 6(1), 70–86. https://doi.org/10.1109/TMM.2003.819583.
Follett, A. (2015). 18 big video marketing statistics: What they mean for you. Video Brewery (blog). Retrieved October 17, 2017, from http://www.videobrewery.com/
blog/18-video-marketing-statistics.
Frøkjær, E., Hertzum, M., & Hornbæk, K. (2000). Measuring usability: Are effectiveness, efficiency, and satisfaction really correlated? Proceedings of the ACM SIGCHI
Conference on Human Factors in Computing Systems (pp. 345–352). . https://doi.org/10.1145/332040.332455.
Furini, M., Geraci, F., Montangero, M., & Pellegrini, M. (2010). STIMO: STIll and MOving video storyboard for the web scenario. Multimedia Tools and Applications,
46(1), 47–69. https://doi.org/10.1007/s11042-009-0307-7.
Geisler, G., Marchionini, G., Nelson, M., Spinks, R., & Yang, M. (2001). Interface concepts for the open video project. Proceedings of the annual meeting of the American
Society for Information Science & Technology. vol. 38. Proceedings of the annual meeting of the American Society for Information Science & Technology (pp. 58–75). .
Retrieved November 19, 2018, from http://ggeisler.com/publications/asist01_geisler.pdf.
Ghani, J. A., Supnick, R., & Rooney, P. (1991). The experience of flow in computer-mediated and in face-to-face groups. Proceedings of the 12th international conference
on information systems (pp. 229–237). . Retrieved October 27, 2017, from http://aisel.aisnet.org/icis1991/9.
Girgensohn, A., Shipman, F., & Wilcox, L. (2011). Adaptive clustering and interactive visualizations to support the selection of video clips. Proceedings of the 1st ACM
international conference on multimedia retrieval (ICMR ’11)34. https://doi.org/10.1145/1991996.1992030.
Goeau, H., Thièvre, J., Viaud, M.-L., & Pellerin, D. (2007). Interactive visualization tool with graphic table of video contents. Proceedings of the IEEE international
conference on multimedia and expo (pp. 807–810). . https://doi.org/10.1109/ICME.2007.4284773.
Goodrum, A. (1997). Evaluation of text-based and image-based representations for moving image documentsUniversity of North Texas PhD dissertation/thesis.
Goodrum, A. A. (2001). Multidimensional scaling of video surrogates. Journal of the American Society for Information Science, 52(2), 174–182. https://doi.org/10.1002/
1097-4571(2000)9999:9999<::AID-ASI1580>3.0.CO;2-V.
Google. (2016). Most common motivations for watching online videos in the United States in 2015. Statista - The Statistics Portal. Retrieved October 17, 2017, from https://
www.statista.com/statistics/354190/motivations-watching-online-videos-us/.
22
Haesen, M., Meskens, J., Luyten, K., Coninx, K., Becker, J. H., Tuytelaars, T., et al. (2013). Finding a needle in a haystack: An interactive video archive explorer for
professional video searchers. Multimedia Tools and Applications, 63(2), 331–356. https://doi.org/10.1007/s11042-011-0809-y.
Harter, S. P. (1992). Psychological relevance and information science. Journal of the American Society for Information Science, 43(9), 602–615. https://doi.org/10.1002/
(SICI)1097-4571(199210)43:9%3C602::AID-ASI3%3E3.0.CO;2-Q.
Hauptmann, A. G., et al. (2005). Lessons for the future from a decade of informedia video analysis research. In W.-K. Leow, (Vol. Ed.), Lecture notes in computer science: .
vol. 3568. Image and video retrieval: 4th International conference, CIVR 2005 (Singapore, July 2005) proceedings (pp. 1–10). Springer. https://doi.org/10.1007/
11526346_1.
He, L., Sanocki, E., Gupta, A., & Grudin, J. (1999). Auto-summarization of audio-video presentations. Proceedings of the 7th ACM international conference on multimedia
(part 1), MULTIMEDIA ’99 (pp. 489–498). . https://doi.org/10.1145/319463.319691.
Hearst, M. A. (2009). Search user interfaces. Cambridge University Press.
Herranz, L., & Jiang, S. (2016). Scalable storyboards in handheld devices: Applications and evaluation metrics. Multimedia Tools and Applications, 75(20),
12597–12625. https://doi.org/10.1007/s11042-014-2421-4.
Higuch, K., Yonetani, R., & Sato, Y. (2017). EgoScanning: Quickly scanning first-person videos with egocentric elastic timelines. Proceedings of the 2017 CHI conference
on human factors in computing systems (CHI ’17) (pp. 6535–6546). . https://doi.org/10.1145/3025453.3025821.
Hu, W., Xie, N., Li, L., Zeng, X., & Maybank, S. (2011). A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics
– Part C: Applications and Reviews, 41(6), 797–819. https://doi.org/10.1109/TSMCC.2011.2109710.
Hughes, A., Wilkens, T., Wildemuth, B., Marchionini, G., et al. (2003). Text or pictures? An eyetracking study of how people view digital video surrogates. In E. M.
Bakker, (Vol. Ed.), Lecture notes in computer science: . vol. 2728. Image and video retrieval: Second international conference, CIVR 2003, proceedings (pp. 271–280).
Springer. https://doi.org/10.1007/3-540-45113-7_27.
Hürst, W. (2006). Interactive audio-visual video browsing. Proceedings of the 14th annual ACM international conference on multimedia (pp. 675–678). . https://doi.org/
10.1145/1180639.1180781.
Hürst, W., Ching, A. I. V., Schoeffmann, K., & Primus, M. J. (2017). Storyboard based video browsing using color and concept indices. Proceedings of the international
conference on multimedia modeling (pp. 480–485). . https://doi.org/10.1007/978-3-319-51814-5_45.
Hürst, W., & Dos Santos Carvalhal, B. (2017). Exploring online video databases by visual navigation. TVX 2017 – Adjunct publication of the 2017 ACM international
conference on interactive experiences for TV and online video (pp. 39–44). . https://doi.org/10.1145/3084289.3089921.
Hürst, W., Götz, G., & Jarvers, P. (2004). Advanced user interfaces for dynamic video browsing. Proceedings of the 12th annual ACM international conference on
multimedia (MULTIMEDIA ’04) (pp. 742–743). . https://doi.org/10.1145/1027527.1027694.
Hürst, W., & Klappe, G. (2017). Design parameters for storyboard-based mobile video browsing interfaces. IEEE international conference on multimedia and expo
workshops (ICMEW) (pp. 405–410). . https://doi.org/10.1109/ICMEW.2017.8026265.
Jackson, D., Nicholson, J., Stoeckigt, G., Wrobel, R., Thieme, A., & Olivier, P. (2013). Panopticon: A parallel video overview system. Proceedings of the 26th annual ACM
symposium on user interface software and technology (UIST ’13) (pp. 123–130). . https://doi.org/10.1145/2501988.2502038.
Jacob, H., Pádua, F. L. C., Lacerda, A., & Pereira, A. C. M. (2017). A video summarization approach based on the emulation of bottom-up mechanisms of visual
attention. Journal of Intelligent Information Systems, 49(2), 193–211. https://doi.org/10.1007/s10844-016-0441-4.
Janes, J. W. (1991). Relevance judgments and the incremental presentation of document representations. Information Processing & Management, 27(6), 629–646.
https://doi.org/10.1016/0306-4573(91)90004-6.
Jansen, M., Heeren, W., & van Dijk, B. (2008). VideoTrees: Improving video surrogate presentation using hierarchy. Proceedings of the international workshop on content-
based multimedia indexing (CBMI 2008) (pp. 560–567). . https://doi.org/10.1109/CBMI.2008.4564997.
Jiao, B., Yang, L., Xu, J., & Wu, F. (2010). Visual summarization of web pages. Proceedings of the 33rd international ACM SIGIR conference on research and development in
information retrieval (pp. 499–506). . https://doi.org/10.1145/1835449.1835533.
Jones, S., Jones, M., & Deo, S. (2004). Using keyphrases as search result surrogates on small screen devices. Personal Ubiquitous Computing, 8, 55–68. https://doi.org/
10.1007/s00779-004-0258-y.
Jorgensen, C. (2003). Image retrieval: Theory and practice. Scarecrow Press.
Joshi, N., Kienzle, W., Toelle, M., Uyttendaele, M., & Cohen, M. F. (2015). Real-time hyperlapse creation via optimal frame selection. ACM Transactions on Graphics,
34(4), 63. https://doi.org/10.1145/2766954.
Kalyuga, S., Chandler, P., & Sweller, J. (1999). Managing split-attention and redundancy in multimedia instruction. Applied Cognitive Psychology, 13(4), 351–371.
https://doi.org/10.1002/(SICI)1099-0720(199908)13:4<351::AID-ACP589>3.0.CO;2-6.
Kent, A., Belzer, J., Kurfeest, M., Dym, E. D., Shirey, D. L., & Bose, A. (1967). Relevance predictability in information retrieval systems. Methods of Information in
Medicine, 6, 45–51. https://doi.org/10.1055/s-0038-1636254.
Kim, J., Guo, P. J., Cai, C. J., Li, S.-W.(D.), Gajos, K. Z., & Miller, R. C. (2014). Data-driven interaction techniques for improving navigation of educational videos.
Proceedings of the 27th annual ACM symposium on user interface software and technology (UIST ’14) (pp. 563–572). . https://doi.org/10.1145/2642918.2647389.
Koenemann, J., & Belkin, N. J. (1996). A case for interaction: A study of interactive information retrieval behavior and effectiveness. Proceedings of the ACM SIGCHI
conference on human factors in computing systems (pp. 205–212). . https://doi.org/10.1145/238386.238487.
Komlodi, A., & Marchionini, G. (1998). Key frame preview techniques for video browsing. Proceedings of the ACM conference on digital libraries (pp. 118–125). . https://
doi.org/10.1145/276675.276688.
Lee, H., & Smeaton, A. F. (2002). Designing the user interface for the Fischlár digital video library. Journal of Digital Information, 2(4), Retrieved January 16, 2018,
from https://journals.tdl.org/jodi/index.php/jodi/article/view/54/57.
Li, Z., Shi, S., & Zhang, L. (2008). Improving relevance judgment of web search results with image excerpts. Proceedings of the 17th international conference on world wide
web (pp. 21–30). . https://doi.org/10.1145/1367497.1367501.
Lokoč, J., Bailer, W., Schoeffmann, K., Muenzer, B., & Awad, G. (2018). On influential trends in interactive video retrieval: Video browser showdown 2015–2017. IEEE
Transactions on Multimedia, 20(12), 3361–3376. https://doi.org/10.1109/TMM.2018.2830110.
Lokoč, J., Kovalčik, G., Münzer, B., Schoeffmann, K., Bailer, W., Gasser, R., et al. (2019). Interactive search or sequential browsing? A detailed analysis of the video
browser showdown 2018. ACM Transactions on Multimedia Computing, Communications, and Applications, 15(1), 29. https://doi.org/10.1145/3295663.
Loumakis, F., Stumpf, S., & Grayson, D. (2011). This image smells good: Effects of image information scent in search engine results pages. Proceedings of the 20th ACM
international conference on information and knowledge management (CIKM) (pp. 475–484). . https://doi.org/10.1145/2063576.2063649.
Low, T., Hentschel, C., Stober, S., Sack, H., & Nürnberger, A. (2017). Exploring large movie collections: Comparing visual berrypicking and traditional browsing.
Lecture notes in computer science: . vol. 10133. International conference on multimedia modeling: Multimedia modeling, proceedings, part II (pp. 198–208). . https://doi.
org/10.1007/978-3-319-51814-5_17.
Marchionini, G., Song, Y., & Farrell, R. (2009). Multimedia surrogates for video retrieval: Toward combining spoken words and imagery. Information Processing &
Management, 45(6), 615–630. https://doi.org/10.1016/j.ipm.2009.05.007.
Marchionini, G., & White, R. (2007). Find what you need, understand what you find. International Journal of Human-Computer Interaction, 23(3), 205–237. https://doi.
org/10.1080/10447310701702352.
Marchionini, G., Wildemuth, B. M., & Geisler, G. (2006). The open video digital library: A Möbius strip of research and practice. Journal of the American Society for
Information Science & Technology, 57(12), 1629–1643. https://doi.org/10.1002/asi.20336.
Matejka, J., Grossman, T., & Fitzmaurice, G. (2013). Swifter: Improved online video scrubbing. Proceedings of the SIGCHI conference on human factors in computing
systems (CHI 2013) (pp. 1159–1168). . https://doi.org/10.1145/2470654.2466149.
Mayer, R., & Moreno, R. (1998). A split-attention effect in multimedia learning: Evidence for dual processing systems in working memory. Journal of Educational
Psychology, 90(2), 312–320. https://doi.org/10.1037/0022-0663.90.2.312.
Mei, T., Yang, B., Yang, S. Q., & Hua, X. S. (2008). Video collage: Presenting a video sequence using a single image. The Visual Computer, 25(2008), 39–51. https://doi.
org/10.1007/s00371-008-0282-4.
23
Moraveji, N. (2004). Improving video browsing with an eye-tracking evaluation of feature-based color bars. Proceedings of the 2004 joint ACM/IEEE conference on digital
libraries (JCDL ’04) (pp. 49–50). . https://doi.org/10.1109/JCDL.2004.240002.
Münzer, B., Schoeffmann, K., et al. (2018). Video browsing on a circular timeline. In K. Schoeffmann, (Ed.). MultiMedia modeling: 24th international conference (MMM
2018), proceedings, Part II (pp. 395–399). Springer. https://doi.org/10.1007/978-3-319-73600-6_40.
Naphade, M. R., Smith, J. R., et al. (2003). A hybrid framework for detecting the semantics of concepts and context. In E. M. Bakker, (Vol. Ed.), Lecture notes in computer
science: . vol. 2728. Image and video retrieval, CIVR 2003 (pp. 196–205). Springer. https://doi.org/10.1007/3-540-45113-7_20.
Natsev, A., Tešić, J., Xie, L., Yan, R., & Smith, J. R. (2007). IBM multimedia search and retrieval system. Proceedings of the 6th ACM international conference on image and
video retrieval (pp. 645). . https://doi.org/10.1145/1282280.1282373.
Nguyen, C., Niu, Y., & Liu, F. (2012). Video Summagator: An interface for video summarization and navigation. Proceedings of the ACM SIGCHI conference on human
factors in computing systems (pp. 647–650). . https://doi.org/10.1145/2207676.2207767.
Nielsen. (2018). Weekly time spent by U.S. adults watching video content via computer as of 1st quarter 2018, by age group (in minutes). Statista - The Statistics Portal.
Retrieved May 14, 2019, from https://www.statista.com/statistics/323867/us-weekly-minutes-online-video-age/.
O'Connor, B. (1985). Access to moving image documents: Background concepts and proposals for surrogates for film and video works. Journal of Documentation, 41(4),
209–220. https://doi.org/10.1108/eb026781.
Paivio, A. (1986). Mental representations: A dual coding approach. Oxford: Oxford University Press.
Panofsky, E. (1955). Meaning in the visual arts: Papers in and on history. Garden City, NY: Doubleday Anchor Books.
Peker, K. A., & Divakaran, A. (2003). An extended framework for adaptive playback-based video summarization. SPIE Internet Multimedia Management Systems IV, 5242,
26–33. https://doi.org/10.1117/12.514742.
Petrelli, D., & Auld, D. (2008). An examination of automatic video retrieval technology on access to the contents of an historical video archive. Program: Electronic
Library and Information Systems, 42(2), 115–136. https://doi.org/10.1108/00330330810867684.
Pirolli, P., & Card, S. (1999). Information foraging theory. Psychological Review, 106(4), 643–675. https://doi.org/10.1037/0033-295X.106.4.643.
Rafferty, P., & Hidderley, R. (2005). Indexing multimedia and creative works: The problems of meaning and interpretation. Hants, England: Ashgate Publishing.
Rath, G. J., Resnick, A., & Savage, T. R. (1961). Comparisons of four types of lexical indicators of content. American Documentation, 12(2), 126–130. https://doi.org/10.
1002/asi.5090120208.
Robertson, G., Czerwinski, M., Larson, K., Robbins, D. C., Thiel, D., & van Dantzich, M. (1998). Data mountain: Using spatial memory for document management.
Proceedings of the 11th annual ACM symposium on user interface software and technology (pp. 153–162). . https://doi.org/10.1145/288392.288596.
Rorvig, M. E. (1993). A method for automatically abstracting visual documents. Journal of the American Society for Information Science, 44(1), 40–56. https://doi.org/
10.1002/(SICI)1097-4571(199301)44:1<40::AID-ASI5>3.0.CO;2-J.
Rossetto, L., Schuldt, H., et al. (2018). The long tail of web video. In K. Schoeffmann, (Ed.). MultiMedia modeling: 24th International conference (MMM 2018), pro-
ceedings, part II (pp. 302–314). Springer. https://doi.org/10.1007/978-3-319-73600-6_26.
Schamber, L., Eisenberg, M. B., & Nilan, M. S. (1990). A re-examination of relevance: Toward a dynamic, situational definition. Information Processing & Management,
26(6), 755–776. https://doi.org/10.1016/0306-4573(90)90050-C.
Schoeffmann, K. (2014). A user-centric media retrieval competition: The video browser showdown 2012–2014. IEEE MultiMedia, 21(4), 8–13. https://doi.org/10.
1109/MMUL.2014.56.
Schoeffmann, K., Hopfgartner, F., Marques, O., Boeszoermenyi, L., & Jose, J. M. (2010). Video browsing interfaces and applications: A review. SPIE Reviews, 1, 18004.
https://doi.org/10.1117/6.0000005.
Schoeffmann, K., Hudelist, M. A., & Huber, J. (2015). Video interaction tools: A survey of recent work. ACM Computing Surveys, 48(1), 14. https://doi.org/10.1145/
2808796.
Schoeffmann, K., Münzer, B., Primus, M. J., Kletz, S., & Leibetseder, A. (2018). How experts search different than novices: An evaluation of the diveXplore video
retrieval system at video browser showdown 2018. Proceedings of the IEEE international conference on multimedia & expo workshops (ICMEW)https://doi.org/10.
1109/ICMEW.2018.8551552.
Schoeffmann, K., Taschwer, M., & Boeszoermenyi, L. (2010). The video explorer: A tool for navigation and searching within a single video based on fast content
analysis. Proceedings of the 1st annual ACM SIGMM conference on multimedia systems (pp. 247–258). . https://doi.org/10.1145/1730836.1730867.
Smeaton, A. (2001). Content-based access to digital video: The Físchlár system and the TREC video track. MMCBIR 2001 – Multimedia content-based indexing and
retrieval (INRIA, Rocquencourt, France, September 2001). Retrieved October 17, 2017, from http://doras.dcu.ie/330/1/mmcbir_2001.pdf.
Smeaton, A. F., Gurrin, C., & Lee, H. (2006). Interactive searching and browsing of video archives: Using text and using image matching. In R. I. Hammoud (Ed.).
Interactive video: Algorithms and technologies (pp. 189–206). Springer. https://doi.org/10.1007/978-3-540-33215-2_9.pdf.
Smeaton, A. F., Murphy, N., O'Connor, N. E., Marlow, S., Lee, H., McDonald, K., et al. (2001). The Físchlár digital video system: A digital library of broadcast TV
programmes. Proceedings of the 1st ACM/IEEE-CS joint conference on digital libraries (JCDL) (pp. 312–313). . https://doi.org/10.1145/379437.379696.
Smeaton, A. F., Over, P., & Kraaij, W. (2006). Evaluation campaigns and TRECVid. Proceedings of the 8th ACM international workshop on multimedia information retrieval
(pp. 321–330). . https://doi.org/10.1145/1178677.1178722.
Smith, M., & Kanade, T. (1998). Video skimming and characterization through the combination of image and language understanding. Proceedings of the 1998 IEEE
international workshop on content-based access of image and video databases (pp. 61–70). . https://doi.org/10.1109/CAIVD.1998.646034.
Snoek, C.G.M., van de Sande, K.E.A., de Rooij, O., Huurnink, B., Uijlings, J.R.R., van Liempt, J. et al. (2009). The mediamill TRECVID 2009 semantic video search
engine. Retrieved January 26, 2018, fromhttp://epubs.surrey.ac.uk/733282/2/mediamill-TRECVID2009-final.pdf.
Snoek, C. G. M., Worring, M., de Rooij, O., van de Sande, K. E. A., Yan, R., & Hauptmann, A. G. (2008). VideOlympics: Real-time evaluation of multimedia retrieval
systems. IEEE Multimedia, 15(1), 86–91. https://doi.org/10.1109/MMUL.2008.21.
Song, Y., & Marchionini, G. (2007). Effects of audio and visual surrogates for making sense of digital video. Proceedings of the ACM SIGCHI conference on human factors
in computing systems (pp. 867–876). . https://doi.org/10.1145/1240624.1240755.
Spencer, S. (2010, March 18). Anatomy of a Google snippet. Search Engine Land. Retrieved December 11, 2017, from https://searchengineland.com/anatomy-of-a-
google-snippet-38357.
Sperber, D., & Wilson, D. (1986). Relevance: Communication and cognition. Cambridge, MA: Harvard University Press.
Srinivasan, S., Ponceleon, D., Amir, A., & Petkovic, D. (1999). “What is in that video anyway?”: In search of better browsing. Proceedings of the IEEE international
conference on multimedia computing and systems (pp. 388–393). . https://doi.org/10.1109/MMCS.1999.779235.
Taskiran, C., Chen, J.-Y., Albiol, A., Torres, L., Bouman, C. A., & Delp, E. J. (2004). ViBE: A compressed video database structured for active browsing and search. IEEE
Transactions on Multimedia, 6(1), 103–118. https://doi.org/10.1109/TMM.2003.819783.
TechCrunch. (2015). Number of daily video views on Facebook as of November 2015 (in billions). Statista - The Statistics Portal. Retrieved October 17, 2017, from https://
www.statista.com/statistics/513462/facebook-daily-video-views/.
Teevan, J., Cutrell, E., Fisher, D., Drucker, S. M., Ramos, G., André, P., et al. (2009). Visual snippets: Summarizing web pages for search and revisitation. Proceedings of
the ACM SIGCHI conference on human factors in computing systems (pp. 2023–2032). . https://doi.org/10.1145/1518701.1519008.
Thompson, C. W. (1973). Functions of abstracts in initial screening of technical documents by user. Journal of the American Society for Information Science, 24(4),
270–276. https://doi.org/10.1002/asi.4630240407.
Tindall-Ford, S., Chandler, P., & Sweller, J. (1997). When two sensory modes are better than ne. Journal of Experimental Psychology, 3(4), 257–287. https://doi.org/10.
1037/1076-898X.3.4.257.
Tse, T., Marchionini, G., Ding, W., Slaughter, L., & Komlodi, A. (1998). Dynamic keyframe presentation techniques for augmenting video browsing. Proceedings of the
working conference on advanced visual interfaces (AVI ’98) (pp. 185–194). . https://doi.org/10.1145/948496.948522.
Wang, P., & Soergel, D. (1998). A cognitive model of document use during a research project. Study I. Document selection. Journal of the American Society for
Information Science, 49(2), 115–133. https://doi.org/10.1002/(SICI)1097-4571(199802)49:2%3C115::AID-ASI3%3E3.0.CO;2-T.
Westman, S. (2010). Evaluation of visual video summaries: User-supplied constructs and descriptions. International Journal of Digital Libraries, 11(2), 125–140. https://
24
doi.org/10.1007/s00799-011-0071-y.
Wildemuth, B. M., Marchionini, G., Wilkens, T., Yang, M., Geisler, G., Fowler, B., et al. (2002). Alternative surrogates for video objects in a digital library: Users’
perspectives on their relative usability. In M. Agosti, & C. Thanos (Vol. Eds.), Lecture Notes in Computer Science: . vol. 2458. Proceedings of research and advanced
technology for digital libraries: 6th European conference (ECDL 2002) (pp. 493–507). Springer. https://doi.org/10.1007/3-540-45747-X_36.
Wildemuth, B. M., Marchionini, G., Yang, M., Geisler, G., Wilkens, T., Hughes, A., et al. (2003). How fast is too fast? Evaluating fast forward surrogates for digital
video. Proceedings of the 3rd ACM/IEEE-CS joint conference on digital libraries (JCDL 2003) (pp. 221–230). . https://doi.org/10.1109/JCDL.2003.1204866.
Woodruff, A., Rosenholtz, R., Morrison, J. B., Faulring, A., & Pirolli, P. (2002). A comparison of the use of text summaries, plain thumbnails, and enhanced thumbnails
for web search tasks. Journal of the American Society for Information Science & Technology, 53(2), 172–185. https://doi.org/10.1002/asi.10029.
Yang, M., & Marchionini, G. (2005). Deciphering visual gist and its implications for video retrieval and interface design. ACM SIGCHI conference on human factors in
computing systems: CHI '05 extended abstracts (pp. 1877–1880). . https://doi.org/10.1145/1056808.1057045.
Yang, M., Wildemuth, B. M., Marchionini, G., Wilkens, T., Geisler, G., Hughes, A., et al. (2003a). Measures of user performance in video retrieval researchChapel Hill, NC:
University of North Carolina, School of Information & Library Science. SILS Technical Report TR-2003-02Retrieved January 11, 2018, from https://sils.unc.edu/
sites/default/files/general/research/TR-2003-02.pdf.
Yang, M., Wildemuth, B. M., Marchionini, G., Wilkens, T., Geisler, G., Hughes, A., et al. (2003b). Measuring user performance during interactions with digital video
collections. Proceedings of the 66th annual meeting of the American Society for Information Science and Technology (ASIST 2003) (pp. 3–11). . https://doi.org/10.
1002/meet.1450400101.
Zhang, Y., & Li, Y. (2008). A user-centered functional metadata evaluation of moving image collections. Journal of the American Society for Information Science &
Technology, 59(8), 1331–1346. https://doi.org/10.1002/asi.20839.
Zipf, G. K. (1949). Human behavior and the principle of least effort: An introduction to human ecology. Cambridge, MA: Addison Wesley.
25

The Usefulness of Multimedia Surrogates For Making Relevance

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Usefulness of Multimedia Surrogates For Making Relevance

Uploaded by

Copyright:

Available Formats

Information Processing and Management 56 (2019) 102091

Contents lists available at ScienceDirect

Information Processing and Management

The usefulness of multimedia surrogates for making relevance

ARTICLE INFO ABSTRACT

2. A framework for surrogation

2.1. Surrogates as representations for representations

2.2. Use of surrogates for finding, filtering, and sense making

2.3. Interactions with surrogates

2.4. Designing surrogates for digital videos

3. Studies of people's interactions with surrogates

3.1. Making relevance judgments about books and journal articles

3.2. Making relevance judgments about web pages

3.3. User interactions with digital videos and their surrogates

3.3.1. Static display of selected keyframes

3.3.2. Moving image display of specific frames

3.3.3. Skims and other multimodal composites

5.2. Video collection/database

Fig. 1. Screen image from an Open Video search for “water”.

5.4. Search tasks

5.5. Study procedures

5.6. Data analysis

5.7. Limitations of the methods

6.1. Characteristics of the study participants

6.2. Accuracy of initial relevance judgments

How often do you watch videos or films? 0 3 6 23 4

Text and poster frame 1 6 5 5 3 20

6.3. Time required to make initial relevance judgments

Text and poster frame 8 (40%) 5 (25%) 7 (35%) 20 (100%)

Text and poster frame 20 4.9 3.1

relevance judgments with each surrogate was evaluated.

6.4. Relationship between accuracy and time

6.5. User perceptions of the surrogates

Text and poster frame 116 6.8 7.9

“earthquake” (concrete, for viewing) 360 15.5 13.2

Viewing: “rivers,” “cars” 725 17.2 17.4

Abstract: “old popular,” “cars” 730 16.6 14.4

Shift = −2 21 23.3 17.3

Surrogate/system Mean s.d. Mean s.d. Mean s.d. Mean s.d.

Storyboard 2.2 1.1 3.4 1.5 2.0 0.9 3.4 1.5

Storyboard 187 29.3%

Viewing: “rivers,” “cars” 103 128 98

Abstract: “old popular,” “cars” 67 75 112

6.6. Effects of individual characteristics on interactions with surrogates

Surrogate 18–21 (n = 16) 22–29 (n = 12) 30–58 (n = 8) Total

Surrogate Occasionally (n = 3) Monthly (n = 6) Weekly (n = 23) Daily (n = 4) Total

Storyboard 9 11 142 25 187

7.1. Accuracy of initial relevance judgments

7.2. Time required to make initial relevance judgments

reasons for this finding warrant further investigation.

7.3. User perceptions of the surrogates

7.4. Methodological issues

8.1. Implications for design of video surrogates

8.2. Implications for understanding the role of surrogates in user interactions

Declaration of Competing Interest

The authors have no competing interest to declare.

You might also like