Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Bridging the Gap Between Tagging and Querying Vocabularies:

Analyses and Applications for Enhancing Multimedia IR


Kerstin Bischoff ∗ , Claudiu S. Firan, Wolfgang Nejdl, Raluca Paiu
L3S Research Center, Appelstr. 4, Hannover-30167, Germany

Abstract

Collaborative tagging has become an increasingly popular means for sharing and organizing Web resources, leading
to a huge amount of user-generated metadata. These annotations represent quite a few different aspects of the resources
they are attached to, but it is not obvious which characteristics of the objects are predominantly described. The
usefulness of these tags for finding / re-finding the annotated resources is also not completely clear. Several studies
have started to investigate these issues, however only by focusing on a single type of tagging system or resource.
We study this problem across multiple domains and resource types and identify the gaps between the tag space and
the querying vocabulary. Based on the findings of this analysis, we then try to bridge the identified gaps, focusing
in particular on multimedia resources. We focus on the two scenarios of music and picture resources and develop
algorithms, which identify usage (theme) and opinion (mood ) characteristics of the items. The mood and theme labels
our algorithms infer are recommended to the users, in order to support them during the annotation process. The
evaluation of the proposed methods against user judgements, as well as against expert ground truth reveal the high
quality of our recommended annotations and provide insights into possible extensions for music and picture tagging
systems to support retrieval.

Key words: Web 2.0, tag analysis, web information retrieval, knowledge discovery, tag recommendation

1. Introduction different tagging systems: Del.icio.us, Flickr and


Last.fm.
Web 2.0 tools and environments have made col- Our comparative study of the users’ tagging and
laborative tagging very popular, resulting in a huge querying habits reveals some interesting aspects.
amount of user-generated metadata. Still, users’ mo- While in general tag and query distributions have
tivations for tagging, as well as the types of assigned similar characteristics, some significant differences
tags differ across systems, and despite initial inves- are to be noted: usage (theme) is very prevalent in
tigations, their potential to improve search remains user queries for music as well as opinion (mood )
unclear. What kinds of tags are used, and which concepts for music and pictures queries, but many
types can improve search most? We investigate this more tags of these types would be needed.
issue in detail, by analyzing tag data from three very The findings of this analysis are essential, es-
pecially for the case of tagging systems focusing
mostly on multimedia resources. While for Web
∗ Corresponding author. Tel: +49 511 762 17730
pages or publications, tags may not improve re-
Email addresses: bischoff@L3S.de (Kerstin Bischoff),
trieval that much, for pictures, music or movies the
firan@L3S.de (Claudiu S. Firan), nejdl@L3S.de (Wolfgang gain is substantial. Content-based retrieval is still
Nejdl), paiu@L3S.de (Raluca Paiu). not mature enough to enable scalable content-based

Preprint submitted to Elsevier 5 April 2010


search. Moreover, even with the most prominent The methods we propose can be used in various
search engines on the Web today, users are still ways: as part of an application where the recommen-
constrained to search for music or pictures using dations are presented directly to the user, who can
textual queries. In this context, supporting users in select the relevant ones and add them to the item
providing meaningful tags for this type of resource that is currently annotated. Another possibility is to
becomes crucial. index the recommended mood and theme tags, thus
One possibility to make users use keywords enriching the metadata indexes. Last but not least,
from the categories we need is to unobtrusively the recommendations can be used to automatically
recommend such tags and thus support them in create mood or theme-based playlists in case of mu-
the tagging process. Beside minimizing the cogni- sic resources, or mood-based picture catalogs.
tive load by changing the task from generation to The rest of the paper is organized as follows: we
recognition [36], such recommendation of under- start in Section 2 with a review of the relevant liter-
represented but valuable tags will very likely trigger ature, structured around the areas we also address
reinforcement, i.e. enforce preferential attachment. in this work. Next, in Section 3 we present a detailed
As presented in [33,22], seeing previous tag assign- analysis of different tagging systems across different
ments from other users strongly influences which domains and compare tagging and querying vocab-
tags will be assigned next and thus to which tag set ularies, thus trying to identify potential gaps. Build-
a resource’s vocabulary will converge. ing upon the results of this analysis, in Section 4 we
We build upon results of previous studies [6–8], focus on bridging these semantic gaps and propose
where we proposed algorithms relying on tags for solutions suitable for two types of multimedia: mu-
identifying other types of valuable knowledge about sic (Section 4.2) and picture (Section 4.3) resources.
music resources. In the present paper we general- Finally, in Section 5 we conclude and discuss possi-
ize these algorithms and verify their applicability ble extensions of this work.
on other types of multimedia resources – pictures.
We summarize results of previously introduced al-
gorithms which try to identify music genre, styles, 2. Related Work
moods and themes based on existing user annota-
tions. Given the fact that pictures are quite different Given that Web 2.0 tools and platforms have
from music resources, and their associated metadata made collaborative tagging highly popular, some
also differs considerably, it was not completely clear studies have started to investigate tagging motiva-
how, and whether at all, this approach could be em- tions and patterns, but they are usually focusing on
ployed also for pictures. Moreover, we prove the ap- one specific collection only [21,22,33,2,23] or pro-
plicability of the method on a different dataset with vide first qualitative insights across collections from
respect to system design choices beyond object type. very small samples [29,39]. We will shortly review
Instead of music experts’ labels, Flickr social groups, some of the major work related to the areas we
i.e. users pooling pictures around topics, serve as la- also address in this paper: tagging behavior, as well
beled training data. With this refinement there is no as search and knowledge discovery based on social
need for reducing the number of classes by means of tags.
clustering, effectively ruling out one potential source Tag Types. First analyses of tagging systems
of noise and thus achieving highly accurate predic- show that the reasons for tagging are diverse and
tions. with them the kinds of tags used. Marlow et al.[29]
The contributions of this paper are threefold: (1) identifies organizational motivations for tagging, as
We propose a novel approach for accurately detect- well as social tagging motivations, such as opin-
ing emotions in photos relying on collaborative tag- ion expression, the attraction of attention, and self-
ging, state-of-the-art solutions being content based; presentation [21,29] or providing context to friends
(2) With the presented algorithms we manage to [2]. The different tag types shed light on what dis-
bridge some significant gaps in the tagging and tinctions are important to taggers [21]. According
querying vocabularies, which in turn will enable to [39], in free-for-all systems opinion expression,
more efficient multimedia information retrieval; (3) self-presentation, activism and performance tags be-
We present extensive experiments demonstrating come more frequent, while in self-tagging systems,
the performance of the proposed algorithms and like Flickr or Del.icio.us, users tag almost exclu-
compare their results against baseline algorithms. sively for their own benefit of enhanced informa-

2
tion organization [21]. Sen et al.[33] showed in an “sadness”, “anger” and “fear”. An important limi-
experiment on vocabulary formation in the Movie- tation of these approaches is that they can not cap-
Lens system how different design choices affect the ture other, ’external’ sources of emotionality, for ex-
nature/types of tags used, their distributions and ample, events that people may associate with a cer-
the convergence within a group, i.e. the proportions tain piece of music. Eck et al. [16] investigate social
that “Factual”, “Subjective” and “Personal” tags tags for improving music recommendations to atten-
will have. uate the cold-start problem by automatically pre-
Tags Supporting Search. Based on the idea dicting additional tags based on the learned relation-
that tags in bookmarking systems usually provide ship between existing tags and acoustic features. In
good summaries of the resources and that they indi- [11], Last.fm user tags have been used together with
cate the popularity of a page, Bao et al. [4] investi- content-based features for automatic genre classifi-
gated the use of tags for improving Web search. The cation. Underlining the usefulness of social tags for
proposed SocialSimilarityRank measures the associ- music classification, Levy and Sandler [27] found
ation between tags and SocialPageRank accounts for that Last.fm tags define a low-dimensional semantic
the popularity among taggers in terms of a frequency space which is able to effectively capture sensible at-
ranking. In [24], the authors suggest an adapted tributes as well as music similarity. Especially at the
PageRank-like algorithm, FolkRank, to improve ef- track level this space is highly organized by artist
ficient searching via personalized and topic-specific and genre. In their experiments, track term vectors
ranking within the tag space. This can be used to build upon social tags led to high average precision
recommend interesting users, resources and related values for retrieval by genre and artist. Thus, tag
tags to increase the chance of “serendipitous encoun- vector representations of music tracks correspond to
ters”. In music retrieval, tags can be used as an alter- more traditional, well-known music catalog organi-
native or additional possibility to find songs: In [20], zation principles while at the same time providing
Last.fm songs are not only recommended based on rich descriptions for each track based on a very large
track-lists (song and artist) of similar users, but also vocabulary.
by considering (descriptive) tags. Here, tag-based Knowledge Discovery for Pictures. Picture
search algorithms provide better and faster recom- metadata enrichment is similar to music metadata
mendation results than traditional track-based col- enrichment in that the goal can be achieved by ei-
laborative filtering methods. ther using information inferred from the low level
In [23], the authors try to answer the question features of the resources, or from already provided
whether social bookmarking data can be used to user annotations. In [3], user tags are combined with
augment Web search. Their analysis of a Del.icio.us content-based techniques in order to improve data
dataset shows that tags tend to gravitate toward cer- navigation and search: A classifier uses low-level fea-
tain domains and that tags occur in over 50% of the tures, like color and texture, in addition to tags pro-
resources they annotate, thus potentially improv- vided by the users in order to discover new relation-
ing search. Even if the usefulness of tags has been ships between data. ZoneTag 1 [30] automatically
proven at a single-site level, some general study of recommends location tags for photos taken with a
the types of tags used inside multiple systems and mobile phone, based on the phone’s position. In [35],
their general implications for search is still missing. the authors focus on a subset of Flickr pictures and
In this paper we tackle this aspect and perform an analyze the different tag categories used by users to
in-depth analysis over three different systems. annotate their pictures. The analysis is performed
Knowledge Discovery for Music. While au- automatically based on WordNet categories. The
tomatic identification of music themes has not been paper also tackles the aspect of tag recommenda-
studied so far, several experiments in Music Infor- tion.
mation Retrieval have shown a potential to model Rattenbury et al. [32] try to extract event and
the mood from audio content. For example, [28] re- place semantics from tags assigned to photos in
lies on extracted low level features like timbre, in- Flickr relying on burst analysis. In [1], landmark
tensity and rhythm (modeled in a Gaussian Mix- pictures for city sights are identified, accompa-
ture Model) to classify music according to Thayer’s nied by representative tags, by employing machine
model of emotions [37]. Similarly, in [19] the authors learning on the user-generated tags in Flickr . Inves-
propose a schema such that music databases are in-
dexed on four labels of music mood: “happiness”, 1 http://zonetag.research.yahoo.com

3
tigating tag evolution in Flickr , Dubinko et al. [14] Del.icio.us. The Del.icio.us data for our anal-
developed algorithms to find the most interesting ysis was also kindly provided by research partners.
tags to be displayed in Flash animations. Predicting This data was collected during July 2005 by gath-
moods/emotions for pictures is much less popu- ering a first set of nearly 6,900 users and 700 tags
lar than for music. Prior work uses content-based from the start page of Del.icio.us. These were used
methods to analyze and classify facial expressions to download more data in a recursive manner. Addi-
(see [18] for an overview), sometimes also picture tional users and resources were collected by monitor-
mood independent of peoples’ faces [15]. In con- ing the Del.icio.us start page. A list of several thou-
trast to prior work, our algorithms can distinguish sands user names was collected and used for access-
a much richer set of emotions/moods than the of- ing the first 10,000 resources each user had tagged.
ten very simple models underlying content-based From the collected data we extracted resources, tags,
approaches. dates, descriptions, user names, etc. The resulting
collection comprises 323,294 unique tags associated
with 2,507,688 bookmarks.
3. Tag Analysis Usage of tags basically follows a power law distri-
bution for each system (for details please refer to [6]).
The following section presents and discusses the The most evenly distributed system is Flickr , where
results of our comparative investigations of tag us- people almost always tag only their own pictures,
age in Last.fm, Del.icio.us, and Flickr . Looking at not much influenced by others. For Del.icio.us, in-
the usage of different types of tags, we first identify fluence of others is more visible: popular tags are be-
and quantify the distinctions occurring in users’ tag- ing used more often, while tags in the tail have less
ging behavior. Most of the tags are potentially useful weight. Last.fm has even fewer very popular tags,
for search, though not all kinds of tags are equally 60% of the top 100 representing genre information.
valuable. We then investigate how well users’ tag- Since Last.fm covers a very specific domain, tags are
ging and searching behaviors correspond. more restricted than in Flickr , where images can in-
clude everything, and Del.icio.us, which has an even
broader range of topics.
3.1. Datasets In order to improve tag based search, we first need
to know how tags are used and which types of anno-
Last.fm. For our analysis, we have crawled an tations we can expect to find along with resources.
extensive subset of the Last.fm website in May 2007, Given the vast number of tags in our datasets, for
focusing on pages corresponding to tags, music practical constraints we had to sample our data. We
tracks and user profiles. We obtained information manually investigated 900 tags in total. For the three
about a total number of 317,058 tracks and their different tagging systems, we took three samples of
associated attributes, including track and artist 100 tags each to be manually classified based on a
name, as well as tags for these tracks plus their tag type taxonomy presented in Section 3.2. These
corresponding usage frequencies. Starting from the three samples per system included the top 100 tags,
most popular tags, we found a number of 21,177 100 tags starting from 70% of probability density
different tags, which are used on Last.fm for tagging (based on absolute occurrences), and 100 tags be-
tracks, artists or albums. For each of these tags we ginning from 90%. These different samples based on
extracted the number of times each tag has been rank percentages were chosen based on the results of
used, number of users which used the tag, as well prior work [22], which suggested that different parts
as lists of similar tags. of the power law curve exhibit distinct patterns.
Flickr . For comparison with Flickr characteris- Like in other complex systems, in collaborative
tics, we took advantage of data crawled by our re- tagging systems patterns evolve that follow a scale-
search partners during January 2004 and December free power law distribution, indicating convergence
2005. The crawling was done by starting with some of the used vocabulary coexisting with a long tail of
initial tags from the most popular ones and then ex- highly idiosyncratic terms [22,24]. Commonly used,
panding the crawl based on these tags. We used a more general tags have higher proportions [21]. Pos-
small portion of the first 100,000 pictures crawled, sible explanations are the imitation of other users’
associated with 32,378 unique tags assigned with behavior, shared knowledge [21] and preferential at-
different frequencies. tachment [22] as well as effects of system design

4
choices [33,29]. Halpin et al. [22] relate this to the coding as well as instrumentation and music genre.
principle of least efforts: While speakers prefer am- For pictures, this includes camera settings and pho-
biguous and general words with minimum entropy, tographic styles like “portrait” or “macro”. Yet an-
thus minimizing the effort for choosing the word, other way to organize resources is by identifying the
hearers prefer words with high entropy, i.e. high in- Author/Owner who created the resource (author,
formation value. Thus, in a free-for-all tagging sys- artist) or owns it (a music and entertainment group
tem, there is a conflict between agreeing to a conven- like Sony BMG or a Flickr user). Tags can also com-
tion when tagging or accepting the need for complex, ment subjectively on the quality of a resource (Opin-
multiple queries. The folksonomy structure evolves ions/Qualities), expressing opinions based on so-
due to the consensus arising when tagging, even cial motivations typical for free-for-all-tagging sys-
though tagging is mostly for personal benefit. Our tems, or are simply used as rating-like annotations
goal here is to provide descriptive statistics about to ease personal retrieval. Usage context tags sug-
tag type usage depending on popularity. gest what to use a resource for, or the context/task
the resource was collected in and grouped by. These
tags (e.g. “jobsearch”, “forProgramming”, etc.), al-
3.2. Tag Characteristics though subjective, may still be a good basis for rec-
ommendations to other users. Last, Self reference
Defining Tag Types. For the purpose of ana- contains highly personal tags, mostly helpful for the
lyzing the kinds of tags used in collaborative tag- tagger herself.
ging, we propose and use an extended tag taxonomy Clearly, such classification schemes only represent
suitable for different tagging systems. We started one possible way of categorizing things. Quite a few
with an exploratory analysis of existing taxonomies tags are ambiguous due to homonymy (especially
(see [21,33,38]), as well as possible attribute fields for Flickr and Del.icio.us, e.g. “apple”). Here, we
for the different resources to be considered. We kept based our decision on the most popular resource(s)
and refined the most fine-grained scheme presented tagged. During classification we even found some
by Golder and Huberman [21], adding the classes tags considered as ‘factual’ difficult to classify di-
Time and Location, in order to make it applicable rectly. For example, “vacation” can be considered as
to systems other than Del.icio.us, which only fo- the Topic of a Web resource, as well as a personal
cuses on Web page annotations. We went through tag of type Usage context grouping resources for the
several iterations to improve the scheme by classify- next holidays. Similarly “zoo” or “festival” may be
ing sample tags and testing for agreement between depicted in a picture or used as context attributes
multiple raters. Our final taxonomy comprises eight not directly inferable from the resource. Depending
classes, presented together with example tags from on the intended usage as well as probably subjec-
our datasets in Table 1. tive and cultural differences such tags fit into more
Topic is probably the most obvious way to de- than one category. This problem of concise category
scribe an arbitrary resource, referring to what a boundaries also applies to the other categorization
tagged item is about. For music, Topic was defined schemes presented in related work.
to include main subject (e.g. “romance”), title and For evaluating our scheme using inter-rater agree-
lyrics. The Topic of a picture refers to any object or ment, we selected 75 tags per system from our ini-
person displayed. While such Topic information can tial sample (25 randomly chosen tags per popularity
partially be extracted from the content of textual range) and had it assessed by students unfamiliar
resources [23], it is not easily accessible for pictures with the tag categorization scheme. We computed
or music. Tags in the Time category add contex- Cohen’s unweighted Kappa (κ) [13] which aims at in-
tual information about month, year, season, or other dicating the achieved inter-rater agreement beyond-
time related modifiers. This includes the time a pic- chance, as the standard measure to assess concor-
ture was taken, a music piece or Web page was pro- dance for our nominal data. Our raw agreement
duced. Similarly, Location is an additional retrieval value for the κ calculation is about 0.79 given the
cue, providing information about sights, country or sum of 0.77 for the by chance expected frequencies,
town, or the origin of a musician. Tags may also resulting in a κ of 0.71. This is considered substan-
specify the Type, which mainly corresponds to file, tial inter-rater reliability [26].
media or Web page type (“pdf”, “blog”, etc.). In Part of the disagreement observed may be caused
music this category comprises tags specifying en- by ambiguity of the classification scheme. The confu-

5
Nr. Category Last.fm Flickr Del.icio.us
1 Topic romance, revolution people, flowers webdesign, linux
2 Time 80s 2005, july daily, current
3 Location england, african toronto, kingscross slovakia, newcastle
4 Type pop, acoustic portrait, 50mm movies, mp3, blogs
5 Author/Owner the beatles, wax trax wright wired, alanmoore
6 Opinions/Qualities great lyrics, rowdy scary, bright annoying, funny
7 Usage context workout, study, lost vacation, science review.later, travelling
8 Self reference albums i own, seen live me, 100views frommyrssfeeds
Table 1
Tag classification taxonomy, applicable to different tagging systems

sion matrix created for the κ calculation reveals sev- distinctions like “macro”, but most users do not
eral prominent confusion patterns for the Del.icio.us seem to make such professional annotations. For pic-
tags always involving the default Topic category. tures only, Location plays an important role. Usage
Specifically, in several cases we found disagreement context seems to be used more in Del.icio.us and
on whether a tag indicated the Topic or Type, Au- Flickr , while Last.fm as a free-for-all-tagging system
thor/Owner or Usage context. These may indicate (with lower motivation for organization) exhibits
fuzzy category boundaries, subjectivity and cultural a significantly higher amount of subjective (Opin-
dependency, showing the direction of further im- ions/Qualities) tags. Time and Self reference only
provements. To account for the ambiguity in tag represent a very small part of the tags studied here.
meaning and tag function for certain resources, we Author/Owner is a little more frequent, though very
gave the rater a chance to name a second category rarely used in Flickr due to the fact that people
that would fit as well. Taking into account this sec- mainly tag their own pictures [29].
ond possible category for a tag, our κ improved con- Studying the distributions for all systems across
siderably to 0.80. all samples, we find a clear tendency of preferred tag
Distribution of Tag Types Across Systems. functions that do not depend much on the popular-
Having defined this general tag taxonomy, we are in- ity of the tags. For example, we observe that Type re-
terested in seeing the tag distributions over the eight mains the predominant tag category for music, while
tag classes. We classified all sample tags taken from for URLs and pictures it is Topic. For the long tail
the three different systems according to the estab- of the Last.fm sample, usage of Type category some-
lished taxonomy. The resulting distributions of tag what decreases, and opinion expression and artist
types across systems are shown in Figure 1 2 . A gen- labeling (Author/Owner ) get more important.
eral conclusion is that tag types are very different With respect to exploiting tags in web search, it is
for different collections. Specifically, the most im- encouraging to see, that most tags are factual in na-
portant category for Del.icio.us and Flickr is Topic, ture, verifiable and thus potentially relevant to the
while for Last.fm, the Type category is the most community and other users. This applies to Topics
prominent one, due to the abundance of genre tags, and resource Type in general, Topic and Location
which fall into this class. Obviously, genre is the for pictures, and to a certain degree Type for mu-
easiest way of characterizing and organizing music. sic. Subjective and personal tags (categories 6, 8)
One of the rare exceptions was for the theme “ro- are only a minor part. Similar to results reported
mance” and some parts of the lyrics or title. In con- in [39], Opinions/Qualities are only characteristic
trast, a similar dominance can be observed for Topic for social, free-for-all music tagging systems (like
for Web resources and pictures. Type is also com- Last.fm), possibly because for young people (expos-
mon in Del.icio.us, as it specifies whether a page ing) music taste is one important aspect in forming
contains certain media. As Flickr is used only for one’s own personal identity.
pictures, Type variations only include fine grained Other interesting results of this analysis refer to
the added value of tags to existing content. From
2
the Del.icio.us crawl we had extracted 20,911 URLs
In later work, we classified 700 sample tags per tagging
system, resulting in similar distributions [5]

6
68%
57%
48%

27%

16%
7% 9% 8% 6%
5% 2% 5% 5% 6% 6% 7% 4% 4%
3% 2% 2% 0% 1% 2%

Topic Time Location Type Author Opinions Usage Selfreference


context

Del.icio.us Flickr Last.fm

Fig. 1. Tag type distributions across systems

for which we had the full HTML page 3 . For these we searched for the tags corresponding to the track.
we counted how many tags appear in the Web page 73.01% of the track tags occurred inside such review
text they annotate and found that this is the case pages. This overlap is rather high, and probably
for only 44.85% of the selected Del.icio.us tags. In caused by the fact that most of the Last.fm tags
other words, more than 50% of existing tags bring represent genre names, which also occur very often
new information to the resources they annotate. In in music reviews.
the music domain this is the case for 98.5% of the Second, we investigated how many of the tags as-
tags, as Last.fm tags are usually not contained at signed to tracks occurred in the manually created
all in lyrics (the only textual original content avail- expert reviews from AllMusic.com. We randomly se-
able). For a subsample of 77,498 tracks, we took all lected music tracks from our Last.fm dataset and
tags corresponding to the tracks and tried to find crawled the Web pages corresponding to their All-
them in the track lyrics. On average, only 1.54% of Music.com reviews. If no review was available for
the tracks’ tags occurred in the lyrics. Especially for one track, we tried to find the review Web page of the
multimedia data, such as music, pictures or videos, album featuring that track. The resulting dataset
the gain provided by the newly available textual in- consisted of 3,600 reviews. Following the same proce-
formation is substantial. dure as for the previous experiment with reviews re-
We also showed that a large amount of tags is trieved via Google, we found that 46.14% of the tags
accurate and reliable. In the music domain, for belonging to a track occurred on the AllMusic.com
example, 73.01% of the tags also occur in online review pages. We hypothesize that the lower num-
music reviews retrieved by Google, 46.14% are even ber of matches is due to the fact that AllMusic.com
to be found in expert reviews on AllMusic.com. reviews are created by a relatively small number of
To analyze the overlap between tags assigned to human experts, which use a more homogeneous and
Last.fm tracks and music reviews extracted from thus restricted vocabulary than that found in arbi-
Google results, we randomly selected 8,130 tracks trary reviews on the Web. Still, at least one Last.fm
from our original dataset, for which we tried to tag occurs in the review texts for almost all ana-
find music reviews by sending queries in the form lyzed tracks. This proves tags to be a reliable source
[“artist” “track” music review -lyrics] to Google. of metadata about songs, created easily by a much
The same query was used in [25]. For each of the higher number of users.
selected tracks we considered the top 100 Google
results, and extracted the text of the corresponding
pages to create one single document inside which 3.3. Usefulness of Tags for Search

3 The HTML pages were taken from a WebBase crawl (http: Extending and complementing our tagging anal-
//dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/) ysis, we also explored how users’ searching and tag-

7
50%
47%

36%
28%
24%
16% 15% 18% 16%
6% 7% 5% 5%
1%
3% 3% 3% 5% 5% 5%
1% 0% 0% 1%

Topic Time Location Type Author Opinions Usage Selfreference


context

Web Images Music

Fig. 2. Distribution of query types for different resources

ging behavior compare. In this experiment, we in- than for searching and Author/Owner being some-
vestigated how much current Web queries overlap what more important for queries than for tagging.
with tags. We used the AOL query logs [31] to calcu- Interestingly, there seem to be many more subjective
late the overlap between Web queries and tags, and queries asking for Opinions/Qualities like “funny”,
contrasted tag and query classes. In our compara- “public” or “sexy” pictures. With decreasing pop-
tive analysis of tags and queries we tried to map web ularity of queries, this category becomes somewhat
queries onto the tag taxonomy established in Sec- less important, but prevalence of Topic increases
tion 3.2, thus investigating which kind of tags could slightly. While the number of ‘adult content’ queries
best answer a given query. We built a frequency in picture search was high for all three subsamples of
sorted list of all queries in the AOL log and took varying popularity, this kind of tags was completely
three samples from different regions of the power law underrepresented in our analyzed samples of Flickr ,
curves. We sampled 300 queries per type of resource Del.icio.us and Last.fm (one tag in Del.icio.us) 4 .
(images, songs, Web pages), by filtering the query The biggest deviation between queries and tags
log for queries containing a keyword (like “music”, occurs for music queries. While our tags in Last.fm
“song”, “picture”, etc.) or having a click on Last.fm are to a large extent genre names, user queries often
or Flickr . The resulting queries were classified into belong to the Usage context category (like “wedding
our eight categories, with queries belonging to mul- songs” or songs from a movie, category 7). Also,
tiple classes in case they consisted of terms corre- users search for known music by artist (category 5),
sponding to different functions. title or theme (category 1). These differences may
The results are shown in Figure 2. Not quite sur- be due to information value considerations: as artist
prisingly, general Web queries often name the Topic and title are already provided in Last.fm as formal
of a resource, just like tags in Del.icio.us do to an metadata there is no need to tag resources with this
even larger extent. The query distribution pattern information. Lyrics are not frequently searched for.
fits to distributions of tags except for a clear differ- A surprising observation is that searching by genre
ence regarding category 5 (Author/Owner ). Usage is rare: Users intensively use tags from this category,
context is more often used for tag based information but do not use them to search for music. One reason
organization than for search. For obvious reasons, for this might be that many music pieces get tagged
Self reference is not a useful query type for public with the same genre and thus search results for genre
Web resources. queries would contain far too many hits. Categoriz-
For images, our tag type distribution almost per- ing tracks into genre is also subjective to a certain
fectly corresponds to the query type patterns. As extent, as it depends on the annotator’s expertise.
Figure 2 shows, Topic accounts for about half of the
queries, as well as of the tags in Flickr . Slight dif-
ferences exist for Location, used more for tagging 4 This holds also for the larger sample analyzed in [5]

8
The amount of subjective qualities asked for is com- 4.1. Datasets
parable to those tagged in the Last.fm system.
Comparing categories of tags and queries offers To obtain the datasets for our experiments we
some interesting insights: Most of the general Web used several data sources: Last.fm, AllMusic.com,
queries are Topic-related queries (as most of the www.lyricsdownload.com, www.lyricsmode.com
tags for Del.icio.us and Flickr ). For Web resources and Flickr . In the following we present some rele-
Topic tags are very useful, as over 30% of the queries vant statistics for all of them.
target this category; but we also see that although AllMusic.com (AM). Established in 1995, the
users query the Author/Owner category, they usu- AllMusic.com website was created as a place and
ally do not tag in this way. For images, the Topic community for music fans. Not only all genres can
category is as prominent and important for tags as be found on AllMusic.com, but also reviews of al-
it is for queries. However, many queries are about bums and artists within the context of their own
Opinions/Qualities but users tend to add more Lo- genres, as well as classifications of songs and al-
cation tags than the needed Opinions/Qualities. So, bums according to themes, moods or instruments.
even if users actually like to search for “funny” or All these reviews and classifications are manually
“scary” pictures, they often do not tag them in this created by music experts from the AllMusic.com
way. As for the music domain, tags generally fall into team, therefore the data found here serves as a good
the Type (i.e. genres) class, although more tags from ground truth corpus. For our experiment we col-
Usage context and Topic categories would be needed lected the AllMusic.com pages corresponding to mu-
(Author/Owner is already present). This leads to sic themes and moods; we could find 178 different
the necessity of providing ways to extend and direct moods and 73 themes. From the pages correspond-
the tagging vocabulary towards the sought classes. ing to moods/themes, we also gathered information
related to which music tracks fall into these cate-
gories and we restrict the dataset to contain only
4. Discovering Knowledge Through Tags tracks also present in our Last.fm crawl.
Last.fm (LFM). For the purpose of our inves-
In the previous section we have seen that tags are tigations, we crawled an extensive subset of the
in general very useful for search applications. Nev- Last.fm website, namely pages corresponding to
ertheless, in some cases we could identify some clear tags, music tracks and user profiles. We started
gaps between the tagging and querying vocabulary: from the crawl described in Section 3.1 and recol-
Usage context for music and Opinion for both music lected the information related to tags associated
and picture resources. Here, many more tags from with music tracks. From all tracks that we obtained
these categories would be needed for supporting the from AllMusic.com, we could also find 13,948 of
very frequent queries targeting such characteristics them in the Last.fm dataset. For this intersection
of the content. In order to bridge these gaps in the we had 81,964 different tags and for each of these
tagging vocabulary, we propose solutions based on tags we have extracted information regarding the
tags. number of times each tag has been used.
In this section we will thus focus on inferring ad- Lyrics (LY). To investigate whether another
ditional information from the tags associated with source of information, namely lyrics, as one part
music and picture resources and recommend it to of music content, can provide added value in the
the users during the annotation process. This way task of mood and theme recommendation, we also
we support users in providing tags from the cate- obtained the corresponding lyrics for our tracks, if
gories we need. More specifically, we will develop available. Here, we used a previous crawl (described
methods to identify the corresponding “moods” in [6]) of the www.lyricsdownload.com site. Ad-
and “themes” for songs, as well as picture “moods”. ditionally, we crawled the www.lyricsmode.com
With “mood” we understand the state or quality website, such that we could gather the lyrics for a
of a particular feeling induced by listening to a total of 6,915 tracks.
particular songs / seeing a particular photo (e.g. Flickr (F). For the purpose of deriving mood
aggressive, happy, sad, funny, etc.). The “theme” of labels for pictures, we collected data from Flickr .
a song refers to the context or situation which best We started by manually selecting Flickr groups that
fits for listening to songs (e.g. at the beach, dinner correspond to the emotion/mood labels we wanted
ambiance, night driving, party time, etc.).

9
to predict, and more explicitly, we made use of the God It’s Friday!” or “Girls Night Out”. Neverthe-
hierarchy of human emotions presented in Table 2. less, when inspecting the tags Last.fm users pro-
We found corresponding Flickr groups for 17 out vided for this track, we cannot really identify these
of the 25 secondary emotions, including the six pri- concepts. Instead, tags such as “pop”, “disco”, “70s”
mary emotion labels. For all pictures in the iden- or “dance” are quite often employed. With the al-
tified groups we downloaded via Flickr ’s API 5 all gorithms we describe in this section we can provide
associated metadata, in particular the user assigned users with mood- and theme-related tags to choose
tags. from during the tagging process and we use the AM,
LFM and LY datasets introduced in Section 4.1.
Primary (Man. 1st ) Secondary Emotion (Man. 2nd )
Love Affection, Lust, Longing
4.2.1. Music Mood and Theme Recommendation
Cheerfulness, Zest, Pride, Optimism Algorithm
Joy
Contentment, Enthrallment, Relief To recommend themes and moods, we base our
Surprise Surprise solution on collaboratively created social knowl-
edge, i.e. tags associated with music tracks, ex-
Irritation, Exasperation, Rage,
Anger tracted from Last.fm, as well as on lyrics informa-
Disgust, Envy, Torment
tion. Based on already provided user tags, on the
Sadness Suffering, Sadness, Disappointment, lyrics of music tracks, or on combinations of the two,
Shame, Neglect, Sympathy we build classifiers which try to infer other anno-
Fear Horror, Nervousness tations corresponding to the moods and themes of
Table 2 the songs. Our approach thus relies on the following
Hierarchy of basic human emotions [34] hypotheses:
(i) Existing tags provided by users for a particu-
lar song carry information which can be used
4.2. Deriving Music Moods and Themes to infer the mood or theme of that song, e.g.
songs tagged with “hard-rock” are more likely
to have an “aggressive” mood than “mellow”-
As we could see in Section 3, the majority of
tagged songs.
tags associated with music resources corresponds to
(ii) The lyrics of the tracks give a hint on the
genre information (around 60% of the tags). This
mood or theme of the songs. For example,
is somehow redundant information, as it can also
tracks with love-related lyrics have “romantic
be extracted from ID3 tags. Considerably less fre-
evening” as theme and correspondingly, a “ro-
quent are tags referring to moods (20%) or themes
mantic” mood.
(5%), though when searching for music, the ma-
In order to recommend mood and theme annotations
jority of queries falls into these categories: 30% of
we thus build probabilistic classifiers trained on the
the queries are theme-related, 15% target mood in-
AllMusic.com ground truth using tags and/or lyrics
formation and the rest being almost uniformly dis-
as features. Separate classifiers correspond to the
tributed among six other categories.
different types of classes that we aim to recommend
A natural question that arises is therefore: How
and to build the classifiers, we use the open source
can we support users to provide these kinds of tags?
machine learning library Weka 6 . In the experiments
Consider for example the song of ABBA, “Dancing
presented in this paper, we use the Naı̈ve Bayes
Queen”: by listening to the song or just considering
Multinomial implementation. Several other classi-
the lyrics (“Friday night and the lights are low / look-
fiers (e.g. Support Vector Machines, Decision Trees)
ing out for the place to go / where they play the right
have been tested, which resulted in similar classifi-
music / getting in the swing ...”) one immediately
cation performances, but were much more compu-
gets transposed into a weekend party atmosphere
tationally intensive. We have one classifier trained
and an enjoyable state of mind. It would therefore
for the whole available set of classes (i.e. either for
be natural to describe and also search for this song
moods or themes) and this classifier produces for
with mood related words such as “fun”, “happy”,
every song in the test set a probability distribution
etc. and with theme tags like “Party Time”, “Thank

5 http://www.flickr.com/services/api/ 6 http://www.cs.waikato.ac.nz/~ml/weka

10
over all classes (e.g. over all moods). Thus, one or ondary human basic emotions [34]. The details for
more classes (based on probabilities or on a given these clustering methods are provided at the end of
rank number) can be then assigned to each song. this subsection.
Based on the hypotheses enumerated above, we As we need a certain amount of input data in or-
also experiment with three types of input features der to be able to consistently train the classifiers,
for the classifier: (1) tags; (2) words from lyrics; or we discard those classes that have less than 30 songs
(3) tags and lyrics. Depending on the type of fea- assigned (step 2). After selecting separate sets of
tures used to train the classifier and on the type songs for training and testing in step 3 (e.g. for ev-
of class that the classifier will assign to songs, we ery fold in a 10-fold cross-validation), we build the
propose 6 experimental settings (2 types of output feature vectors corresponding to each song in the
classes – moods and themes – and 3 types of features training set (step 4). In the case of features based
– tags, lyrics, tags+lyrics). on tags, the vectors have as many elements as the
Algorithm 1 presents the main steps of our ap- total number of distinct tags assigned to the songs
proach. We show the algorithm for mood recommen- belonging to the mood classes. The elements of a
dations based on tag features, the other algorithms vector will have values depending on the frequency
being corresponding variants. of the tags occurring along with the song. In com-
puting the vector elements, we experimented with
Alg. 1. Tag-based Mood recommendation
different variations and automatic feature selection
1: Apply clustering method on mood classes (optional) (e.g. Information Gain), but the formula based on
2: Select classes of moods M to be learned the logarithm of the tag frequency provided best re-
2a: For each mood class sults and the full set of features was better suited
2b: If the class does not contain at least 30 songs
Discard class
for learning, even though it contained some noise.
3: Split song set Stotal into Once the feature vectors are constructed, they are
Strain = songs used for training the classifier fed into the classifier and used for training (step 5).
Stest = songs used for testing recommendations A model is learned and afterwards is applied to any
4: Select tag features for training the classifier
4a: For each song si ∈ Strain new, unseen data. We can choose how many moods
4b: Create feature vector F (si ) = {tj |tj ∈ T }, are recommended to the user based on the proba-
where bilities resulting from the classification or by setting
T =( set of tags from all songs in all classes an absolute threshold (steps 6a-c).
log(f req(tj ) + 1), if si has tag tj ; Clustering. The WordNet-based clustering
tj =
0, otherwise. of themes aims at clustering semantically related
5: Train Naı̈ve Bayes classifier on Strain theme labels. On average, the 73 themes have 1.6
using {F (si ); si ∈ Strain } words (including stopwords; and 1.55 when discard-
6: For each song si ∈ Stest
6a: Compute probability distribution P (si ) as
ing the stopwords). For each of the 73 themes we
P (si ) = {p(mj |si ); mj ∈ M } first process the corresponding words this theme
6b: Select top k moods Mtop−k from M consists of. All stopwords are removed, and for
based on p(mj |si ) the remaining words we extract the corresponding
6c: Recommend Mtop−k to the user
WordNet synonyms. All resulting synsets are com-
pared pairwise and if the overlap between two sets
Step 1 of the algorithm above aims at reducing is at least two words, the corresponding themes are
the number of mood classes to be predicted for the clustered. With this procedure, the resulting set of
songs, since the 178 AllMusic.com mood labels are themes contained 58 entries.
hardly distinguishable for a non-expert. This step is For manually grouping the 178 AllMusic.com
optional, as we experiment with all classes of moods moods we made use of the extensive work already
/ themes from AllMusic.com, as well as with a sub- done on studying human emotions. Though there is
set resulting by applying a clustering method on the little agreement on the exact number of basic emo-
original set. In this paper, we present only the re- tions let alone on a taxonomy including combina-
sults for the best performing classifiers, i.e. themes tions of the basic concepts into complex, secondary
clustered based on synonymy relationships (Word- emotions, we found the hierarchy reported in Shaver
Net 7 ) and moods clustered into primary and sec- et al. [34] useful for our task (see Table 2). Moods
are usually considered very similar to emotions but
7 http://wordnet.princeton.edu being longer in duration, less intensive and missing

11
object directedness. For categorizing the AllMu- which our approach relies (see Section 4.2.1). Lyrics,
sic.com moods we had to slightly adapt the tax- in contrast to tags, introduce noise, as many song
onomy to fit our data: Surprise was removed since texts contain all sorts of interjections (e.g. “hey”,
no example moods were found; the same happened “uh-huh”, etc.), slang or simply informal English.
for some secondary emotions. Since some moods While incorporating lyrics features helps to achieve
do not actually denote a mood (e.g. “literate”), we good results for genre [7] recommendations, they do
introduced a new class (Neutral ) with three sec- not seem to be indicative of the mood of a song. For
ond level classes. In total, we obtained 23 second themes, there is a slight, yet rarely significant, ef-
level classes (“Man. 2nd ”) falling into six first level fect. Though alone lyrics are obviously not descrip-
classes (“Man. 1st ”). We also adopted a procedure tive enough to decide well upon the theme, by set-
similar to the one used in [34] to build the aforemen- ting the topic, lyrics may help removing some tag
tioned taxonomy of basic and secondary emotions. ambiguity. This relates to the second hypothesis on
In a similarity sorting task, all AllMusic.com theme which we built our approach.
terms written on cards were sorted by the authors For the case of theme recommendations, the best
into as many and as high piles as seemed appro- results, H@3 of 0.88, are achieved for the algorithm
priate. There was no limit with respect to number using a combination of tags and lyrics as features
of clusters and their size. For each person a co- and applying a WordNet synonymy based clustering
occurrence matrix stated whether two themes were on the theme classes. Compared to themes, mood
placed in one category (1) or not (0). Individual recommendations do not perform as well when using
matrices were built and added to find good group- many classes, achieving only a H@3 of 0.64. For the
ings by analyzing the clusters. Unclear membership case of moods, we present the results corresponding
of singular labels was resolved after discussion. to both first and second level manual clustering of
the original AllMusic.com classes (rows “Man. 1st ”
and “Man. 2nd ”). Reducing the number of clusters
4.2.2. Experiments and Results to the 6 first level classes (“Man. 1st ”), correspond-
To measure the quality of our algorithms, we eval- ing roughly to basic human emotions, boosts the
uate our mood and theme tag predictions against performance considerably and for the best method
the corresponding assignments in the AllMusic.com using tags and lyrics as input features we achieve
dataset. Being manually created by music experts, a H@3 value of 0.89. Of course, this task is now
the assignments of songs to classes of moods and much easier as can be seen from the as well much
themes can be considered correct and thus accepted better performance of the random classifier. Though
as ground truth. Since our goal is recommendation having a larger mood vocabulary for recommenda-
of relevant annotations, we perform a standard 10- tions should be aimed at, trade-offs are necessary.
fold cross-validation and evaluate our results choos- An interesting question for future work is how many
ing the following standard IR metrics: Hit rate at classes are appropriate to describe what mood dis-
rank k (H@k), R-Precision (RP ) and Mean Recip- tinctions people actually do when listening or refer-
rocal Rank (M RR). We concentrate on the H@3 ring to music.
metric, as we recommend three annotations to the Micro-evaluating results moreover per specific an-
users to choose from. We consider three annotations notation class, shows that while some classes are
a good compromise, between providing enough sug- relatively easy to recommend, others may require
gestions and at the same time not overwhelming the special attention or some level of disambiguation.
users with too much information. We present the re- Table 4 shows H@3 values for the different classes
sults for all our experimental runs in Table 3. without applying any clustering method and using
We observe that the best performing methods tags as features. In general, those class labels that
are those using tags as input features for the clas- are harder to recommend appear more ambiguous
sifiers. The methods using only lyrics as features with the corresponding annotations being mostly
perform worst. When combining tags and lyrics as subjective. Themes like “Late Night” or “Summer-
features, the corresponding methods perform much time” strongly depend on the person and what s/he
better than those based only on lyrics and they is used to be doing late night or in the summer 8 .
sometimes also slightly outperform the tag-based
methods. These results confirm once more the qual- 8 In our Facebook evaluation study [7] those themes were
ity of user provided tags, as well as hypothesis 1 on used by many distinct people, thus, the bad performance may

12
Clustering Classes Features H@3 H@5 RP MRR

- 11 Random 0.29 0.47 0.10 0.28


- 11 Tags 0.80 0.92 0.49 0.67
- 11 Lyrics 0.56∗ 0.72∗ 0.26∗ 0.46∗
Themes

- 11 Tags+Lyrics 0.80∗+ 0.94+ 0.48∗+ 0.67∗+


WordNet 9 Random 0.36 0.58 0.12 0.33
WordNet 9 Tags 0.85 0.94 0.47 0.66
WordNet 9 Lyrics 0.72∗ 0.85∗ 0.38∗ 0.59∗
WordNet 9 Tags+Lyrics 0.88+ 0.96+ 0.48+ 0.69+

- 89 Random 0.06 0.09 0.02 0.08


- 89 Tags 0.39 0.51 0.17 0.34
- 89 Lyrics 0.17∗ 0.25∗ 0.06∗ 0.17∗
- 89 Tags+Lyrics 0.37+ 0.48+ 0.15+ 0.32+
Man. 1st 6 Random 0.61 0.89 0.23 0.47
Moods

Man. 1st 6 Tags 0.88 0.99 0.49 0.71


Man. 1st 6 Lyrics 0.82∗ 0.98 0.42∗ 0.65∗
Man. 1st 6 Tags+Lyrics 0.89∗+ 0.99 0.52+ 0.73+
Man. 2nd 22 Random 0.21 0.33 0.07 0.22
Man. 2nd 22 Tags 0.63 0.76 0.31 0.53
Man. 2nd 22 Lyrics 0.49∗ 0.65∗ 0.21∗ 0.41∗
Man. 2nd 22 Tags+Lyrics 0.64+ 0.78+ 0.31+ 0.52+
Table 3
Experimental results: H@3, H@5, RP , M RR for the different algorithms along with a random baseline for comparison. A ∗ or
a + states a statistically significant difference (one-tail paired t-Test with p < 0.05) with respect to tags or lyrics as features,
respectively (per clustering method).

The same is true for moods like “Precious” or “Ram- Instead, we also evaluated the quality of our rec-
bunctious”, as they can be subjectively interpreted ommended themes for music tracks in terms of user
in several ways. On the other hand, classes which judgements. Thus, we set up a user survey as a Face-
can be recommended with high accuracy are also book application 9 (see Figure 3), where users had to
more clearly defined, may it be a theme like “Slow manually label songs with one or more theme classes
Dance” or a mood like “Hypnotic”. Interestingly, used in our algorithms and in AllMusic.com.
for neither moods nor themes we found a correla- With this user survey, we aimed to compare not
tion between the a priori probability of a class, i.e. only the performance of normal users against the
its size in terms of positive examples in the dataset, AllMusic.com experts, but also the results of our
and performance. algorithm against the choices of the users.
It is difficult to directly compare our results to the The results presented in [7] show that our method
related work cited in Section 2, as each paper uses a performs well also with respect to the user assign-
different number of classes. Moreover, experimental ments. The fact that the users perform quite bad
goals, ground truth and evaluation procedures vary, compared to the AllMusic.com experts, but our
or detailed descriptions are missing, e.g. whether method performs well both compared to the users
strict classification into one class is used or many and to the experts, indicates that our method pro-
classes are proposed for one piece of music. vides theme labels that are easier to recognize by
users than the labels assigned by AllMusic.com ex-

not be explained by only few people making ideosyncratic 9 For details on the survey and application, please refer to
use of a term a lot. Unfortunately, for the AllMusic.com [7] or access http://www.facebook.com/apps/application.
mood annotations no user/expert frequencies are available. php?id=20699508679

13
Best #Docs H@3 Worst #Docs H@3

Themes
Slow Dance 40 0.97 Late Night 26 0.52
Romantic Evening 27 0.89 Summertime 43 0.62
Autumn 36 0.89 Party Time 29 0.72

Ethereal 40 0.65 Precious 33 0.00


Moods

Hypnotic 47 0.64 Calm/Peaceful 33 0.00


Angst-Ridden 61 0.57 Rambunctious 30 0.00
Table 4
Examples of best and worst performing (by H@3) classes, without clustering, learned using tags as features. #Docs gives the
number of music tracks used in the experiments per mood/theme.

music, where we could also exploit the lyrics of the


songs, for Flickr pictures, the only available textual
information comes from tag data 10 . The assump-
tion on which we base our recommendation is similar
to the case of songs, namely that the existing tags
attached to photos can possibly provide information
regarding the corresponding mood of the pictures.
Given the crawling methodology which was used
for pictures and in order to ensure a fair classification
of the data, all tags related to a mood or emotion
were deleted. To this end, we looked up in Word-
Net all the labels included in our emotion taxon-
omy and collected all corresponding word forms of
the most popular synset, thus potentially including
synonyms, as well as all their derivationally related
forms like adjectives. For example, for “Anger” the
synset contains the word forms “anger”, “choler”,
and “ire” as well as the derivational forms “angry”,
Fig. 3. Mood Mates! Facebook application “to anger” and “choleric”. The resulting list of terms
was used to remove all matching tags of the collected
perts and thus helps for bridging the gap between Flickr pictures (dataset F, described in Section 4.1).
the users’ and music experts’ vocabularies. This approach has been inspired from work focusing
on the evaluation of personalization methods, where
4.3. Deriving Moods for Pictures parts of the users’ preferences are removed in order
to be later inferred by the proposed personalization
algorithms (see [12,10]).
Similar to the case of music, our analysis for pic-
The remaining tags are used as input features for
tures showed some clear gap between the tagging
training a multi-class classifier over all classes of
and the querying vocabulary. Here, a large portion
moods. Like in the case of music, here we also make
of tags refer to location information, such as the
use of the Weka implementation for the Naı̈ve Bayes
country or city where the picture has been taken.
Multinomial classifier, which produces for all pic-
However, queries targeting images much more of-
tures in the test set probability distributions over all
ten name subjective aspects of the objects or per-
classes of moods.
sons depicted on the photos, e.g. “scary”, “rage” or
“funny”. In this section we will present an approach
which aims at bridging exactly this gap.
10 Other types of textual metadata, like titles, descriptions,
4.3.1. Picture Mood Recommendation Algorithm comments, group memberships, etc., could have been used,
Recommendations of mood annotations in case but we wanted to keep this approach generalizable to other
of pictures, rely only on tag information. Unlike for photo sharing systems, as well.

14
4.3.2. Experiments and Results Mood #Docs H@1 H@3 H@5 RP MRR
Like for music we also aim at evaluating the qual- [Random] – 0.17 0.50 0.83 0.17 0.41
ity of the recommended mood labels for pictures. As

Primary Emotions
[Overall] 52,426 0.89 0.97 0.99 0.89 0.93
ground truth data we use the data collected from
Flickr (set F), since all these pictures have been Fear 7,248 0.87 0.99 1 0.87 0.93
manually assigned by users to Flickr groups cen- Sadness 40,602 0.91 0.96 0.99 0.91 0.94
tered around human emotions/moods. All pictures Joy 1,062 0.77 0.95 1 0.77 0.86
pertaining to a specific mood class represent the pos- Love 1,184 0.71 0.94 0.99 0.7 0.83
itive training examples, while pictures taken ran-
Anger 1,695 0.84 0.94 0.99 0.84 0.89
domly from the rest of the classes build up the set of
negative examples. In all cases, the number of posi- Surprise 635 0.61 0.93 0.98 0.61 0.77
tive and negative examples for a class is equally bal- [Random] – 0.06 0.18 0.29 0.06 0.20
anced. [Overall] 52,452 0.89 0.97 0.98 0.89 0.93
A first set of experiments aimed at recommend-
Horror 6,881 0.84 0.98 0.99 0.84 0.91
ing mood labels corresponding to the primary hu-
Neglect 36,943 0.94 0.98 0.99 0.94 0.96
man emotions and in this case, the classes to be
learned by the classifiers consisted of the union of all Sadness 3,684 0.73 0.98 0.99 0.73 0.85
data belonging to all underlying secondary emotions Nervousness 367 0.95 0.96 0.98 0.95 0.96
Secondary Emotions

(e.g. the “Love” class comprises all data gathered Torment 823 0.88 0.96 0.97 0.88 0.92
from the Flickr groups for “Affection”, “Lust” and Rage 680 0.92 0.95 0.99 0.92 0.94
“Longing”). Similarly, another experimental run fo-
Cheerfulness 443 0.88 0.93 0.95 0.88 0.92
cused on secondary emotion label recommendations,
and in this case each secondary emotion class repre- Surprise 635 0.59 0.91 0.97 0.59 0.75
sented a class to be learned. We performed 10-fold Longing 1,011 0.7 0.89 0.98 0.69 0.81
cross validation and evaluated the performance of Relief 78 0.54 0.74 0.87 0.54 0.67
our method according to the same set of IR met- Disgust 98 0.53 0.63 0.71 0.53 0.63
rics, which were used also for music: Hit rate at rank
Pride 112 0.44 0.63 0.79 0.44 0.58
k (H@k), R-Precision (RP ) and Mean Reciprocal
Rank (M RR). All results are summarized in Table 5. Optimism 308 0.46 0.61 0.81 0.46 0.6
The results confirm once more the hypothesis on Affection 124 0.26 0.52 0.73 0.26 0.46
which we based our recommendation approach: ex- Zest 122 0.18 0.39 0.58 0.18 0.37
isting tags can give good indications regarding the Irritation 94 0.17 0.28 0.35 0.17 0.31
corresponding moods of the pictures. All recommen-
Lust 49 0.06 0.18 0.47 0.06 0.25
dations corresponding to the primary human emo- Table 5
tions achieved very high quality, with a value close Experimental results: H@1, H@3, H@5, RP , M RR for the
to 1 for H@3 and even a H@1 between 0.71 and different algorithms over all picture moods, i.e. primary and
0.91. We also compute the overall performance over secondary emotions; #Docs gives the number of pictures
all primary emotion classes, as averages weighted by per emotion. [Overall] shows the weighted average by the
number of instances in the mood class; [Random] is the
the number of instances corresponding to each class. random baseline for comparison.
The results are very good, with a 0.97 value for H@3
and 0.93 for M RR.
Only for “Surprise” the results were somewhat studies which reported that fear and surprise ex-
poorer, with H@1 of 0.61 and M RR of 0.77. This pressions are easily differentiated from other basic
situation may arise due to the fact that the only emotions, but are often confused with each other
Flickr group which we could select for this mood, both in labeling and posing facial expressions [17].
was less focused (named “Shocking, Surprise and For primary emotions, correlation between class
General Wide Eyes!!”). Looking in detail at the size and performance is medium: Pearson’s r is 0.45
confusion matrix (Figure 4A), this becomes visible: for H@3 and 0.63 for H@1, RP , and M RR. Thus,
“Surprise” is often misclassified as “Fear”. A clear when misclassifying instances the classifier is biased
distinction seems to be difficult for our Flickr users. to incorrectly assigning one of the two dominant
The same fact was also indicated by psychological classes “Fear” or “Sadness”. Besides, these two emo-
tions are very close together in the cluster analysis of

15
A)PrimaryEmotions
PredictedClass
a b c d e f
a = Anger 1438 96 21 32 96 12
CorrectClass

b = Sadness 316 37004 630 334 1878 440


c = Love 10 100 882 36 101 55
d = Joy 8 102 25 817 66 44
e = Fear 58 643 63 55 6337 92
C

f = Surprise 5 21 11 41 172 385

B)SecondaryEmotions
PredictedClass
Predicted Class
a b c d e f g h i j k l m n o p q
a = Longing 720 171 11 49 32 1 10 1 1 15
b = Pride 1 54 16 7 9 1 7 1 2 14
c = Sadness 82 2708 30 346 442 1 3 13 4 19 36
d = Lust 2 9 2 29 1 6
e = Relief 2 44 1 2 12 1 3 13
f = Rage 5 6 626 16 25 1 1
g = Neglect 158 17 665 2 5 66 35031 607 14 13 71 33 41 136 3 59 22
CorrectClass

h = Horror 28 400 27 493 5801 1 6 23 2 2 26 2 69 1


i = Affection 1 24 2 19 37 4 1 36
j = Nervousness 9 5 350 1 2
k = Cheerfulness 3 1 8 1 5 11 394 2 1 17
l = Optimism 9 36 2 43 35 2 11 143 3 24
m = Zest 9 12 1 50 15 7 23 1 4
n = Torment 4 24 1 23 26 2 740 3
o = Di
Disgust t 1 7 33 1 1 2 52 1
p = Surprise 5 54 1 11 176 1 3 9 2 2 2 369
q = Irritation 5 18 1 24 20 5 1 5 15

Fig. 4. Confusion matrices for A) primary and B) secondary emotions as image moods

the mood label space of [34] 11 . Both share the same In general, correlation between a priori probability
negative valence but with different intensity. How- of a class and performance is smaller for secondary
ever, fear may range from low intensity (i.e. being emotions: Pearson’s r is between 0.32 and 0.37 for
worried) to very high intensity (i.e. being panicked). the different evaluation measures. Thus, the larger
Investigating the tag features used in learning these classes “Neglect”, “Horror” and “Sadness” are pre-
classes, similar tags are ranked highly according to dicted wrongly more often then the remaining ones.
their information gain, though all tags have rather Still, the confusion matrix in Figure 4B indicates
small values in general. some interesting patterns not easily explainable by
The overall weighted results for the secondary classifier bias. “Longing” is very often misclassified
human emotion label recommendations are almost as “Sadness”, much more than as the very frequent
identical to those for primary emotions. If the av- “Neglect” or “Horror”. Although they belong to
erages are not weighted by class prevalence, the different primary emotions (“Love” vs. “Sadness”)
overall unweighted averages for secondary emotions and are rather far apart in the cluster analysis of
are about 0.2 lower compared to their counterparts emotion labels, for Flickr users they seem to share
for primary emotions. This is due to the weight- the negative valence and low arousal. Again, “Sur-
ing process favoring the overly frequent and well prise” gets easily confused with the fear related
predicted classes “Neglect” and “Sadness” (corre- emotion of “Horror”.
sponding to the primary emotion “Sadness”) and When inspecting the results over the different
“Horror” (with “Fear” as primary emotion). As mood classes, we can see that for some classes, e.g.
“Sadness” and “Fear” examples are also highly “Affection”, “Zest”, “Irritation” and “Lust”, per-
prominent in our experiment for primary emotions, formance is considerably lower with H@3 ranging
the overall results reported in Table 5 are similar. from 0.18 to 0.52. The main reason for these results
is the relatively small number of pictures contained
11 Participantshad to sort emotion labels into piles of related in each of these groups, which made learning more
emotions, thus establishing a hierarchy/clusters of basic and difficult. Moreover, manually inspecting the corre-
secondary emotions

16
sponding group of Flickr photos for all those four 5. Conclusions and Future Work
classes, we found it difficult to identify pictures
depicting only the intended state of mind for each Collaborative tagging has become an increas-
particular Flickr group. For example, we could ingly popular means for sharing and organizing re-
observe a large number of “Affection” pictures de- sources, leading to a huge amount of user-generated
picting sad/crying people and implicitly a large set metadata, which can potentially provide interesting
of tags close to the tags belonging to the “Sadness” information to improve search. To tap this poten-
class. Given the relatively small set of pictures, the tial, we extended previous preliminary work with a
influence of such ‘ambiguous’ photos and implicitly thorough analysis of the use of tags for different col-
their associated set of tags becomes critical. lections and in different environments. We analyzed
For all other mood classes, we achieve H@3 values three very popular tagging systems, Del.icio.us,
over 0.6, for about half of the classes even 0.90 and Flickr and Last.fm and investigated the type of
more, the best results being obtained for the class tags users employ, their distributions inside the
“Neglect” – 0.98 for H@3 and 0.96 for M RR. general tag classification scheme that we proposed,
Having used the same measures as in the case of as well as their suitability for improving search.
music mood and theme recommendations, we can Our analysis provided evidence for the usefulness
directly compare the two sets of results. In Figure 5 of a common tag taxonomy for different collections
we depict the H@3 and M RR values for all best and has shown that the distributions of tag types
performing theme and mood recommendation algo- strongly depend on the resources they annotate.
rithms for music and pictures. Moreover, we have shown that most of the tags can
1 be used for search and that in most cases tagging
0,9 behavior exhibits approximately the same charac-
0,8
07
0,7
teristics as searching behavior. We also observed
0,6 some noteworthy differences: for the music domain
0,5 Usage context/Theme is very useful for search, yet
04
0,4
underrepresented in the tagging material. Similarly
0,3
0,2 for pictures and music Opinion/Qualities/Mood
0,1 queries occur quite often, although people tend to
0
Music Music Image Image
neglect this category for tagging.
Music
Themes
Moods Moods Moods Moods Building on these results, we proposed a number
Primaryy Secondaryy Primaryy Secondaryy
of algorithms which aim at bridging exactly these
H@3 0,88 0,89 0,64 0,97 0,97
gaps between the tagging and querying vocabu-
MRR 0,69 0,73 0,52 0,93 0,93
laries, by automatically recommending mood and
Fig. 5. H@3 and M RR values across our best music, image,
theme annotations. We trained multi-class classi-
mood and theme recommendations fiers on input features consisting of either only the
existing tags of the picture and music resources,
For music, both theme and primary mood label or of tags and lyrics information, in case of music
recommendations achieve almost equal H@3 val- songs. The results of our evaluations show that
ues of 0.88. Recommendations from the secondary providing such automatic tag recommendations is
mood classes are more error prone, achieving only feasible and that we can achieve very good results
0.64 H@3. For the case of pictures, we do not ob- both when comparing our algorithms with user
serve any difference for either primary or secondary judgements and with expert-created ground truth.
mood recommendations. Moreover, recommenda- Compared to some of our previous papers, where
tions for picture resources are of higher quality, we introduced algorithms trying to identify music
probably due to the data which was used as ground themes based on existing user annotations, here we
truth: mood-related Flickr groups, manually cre- address this aspect in more detail and discuss its’
ated by users. The ground truth gathered from applicability for identifying additional knowledge
AllMusic.com, given the extremely high number of also for other types of multimedia resources. The
mood classes and implicit redundancy, had to be experiments show that even if the resources and
mapped to the hierarchy of human emotions. This the associated metadata differ significantly, we can
process potentially introduces some noise into the still achieve very good results and besides, these
data.

17
approaches have the potential to bridge the existing GLOCAL project funded by the European Commis-
gaps in the users’ tagging and querying vocabular- sion under the 7th Framework Programme (Con-
ies. tract No. 248984).
In general, the results indicate that for music
it is easier to predict the corresponding themes of
the songs rather than the moods. Comparable re- References
sults for the two types of recommendations were
achieved when mapping the AllMusic.com moods [1] R. Abbasi, S. Chernov, W. Nejdl, R. Paiu, S. Staab,
Exploiting flickr tags and groups for finding landmark
to the primary human emotions. On the other hand,
photos, in: Proceedings of the 31th European Conference
mappings into the secondary human emotions are on IR Research (ECIR 2009), vol. 5478 of Lecture Notes
more difficult and thus are susceptible to introduc- in Computer Science, Springer, 2009.
ing noise. Recommendations of moods for picture [2] M. Ames, M. Naaman, Why we tag: motivations for
resources are overall of higher quality than for mu- annotation in mobile and online media, in: Proceedings
of the ACM SIGCHI Conference on Human Factors in
sic, due to the much more consistent set of tags
Computing (CHI 2007), ACM, 2007.
attached to the photos and used as input features. [3] M. Aurnhammer, P. Hanappe, L. Steels, Integrating
Apart from some subjective mood classes, known to collaborative tagging and emergent semantics for
be difficult to distinguish, our tag recommendations image retrieval, in: WWW Collaborative Web Tagging
are of high quality and given the self-reinforcing na- Workshop, 2006.
[4] S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu, Optimizing
ture of user-generated tags, suggesting opinion and
web search using social annotations, in: Proceedings
usage related concepts to users results in a related of the 16th International World Wide Web Conference
tag vocabulary, which eventually will converge to a (WWW2007), ACM, 2007.
more diverse set of tags. [5] K. Bischoff, C. S. Firan, C. Kadar, W. Nejdl, R. Paiu,
For the future, we plan to further improve these Automatically identifying tag types, in: Proceedings of
the 5th International Conference on Advanced Data
algorithms and in particular the feature selection
Mining and Applications (ADMA 2009), vol. 5678 of
mechanisms by automatically identifying the tag Lecture Notes in Computer Science, Springer, 2009.
types (e.g. Topic, Author, Location, etc.) and use [6] K. Bischoff, C. S. Firan, W. Nejdl, R. Paiu, Can all tags
them as input features for the classification. Other be used for search?, in: Proceedings of the 17th ACM
ideas worth investigating refer to identification of conference on Information and knowledge management
(CIKM ’08), ACM, 2008.
other types of information for multimedia resources,
[7] K. Bischoff, C. S. Firan, W. Nejdl, R. Paiu, How do
such as events, persons or locations, as well as other you feel about ”dancing queen”?: deriving mood &
types of entity frequently queried against by users. theme annotations from user tags, in: Proceedings of the
Moreover, we plan to use, like in the approach de- 2009 Joint International Conference on Digital Libraries
scribed by Blum et al. [9], co-training, to alleviate (JCDL 2009), ACM, 2009.
[8] K. Bischoff, C. S. Firan, R. Paiu, Deriving music theme
the problem of the limited (labeled) music ground
annotations from user tags, in: Proceedings of the 18th
truth. Last but not least, we would like to perform World Wide Web Conference (WWW2009), ACM, 2009.
another type of evaluation, where the value of the in- [9] A. Blum, T. Mitchell, Combining labeled and unlabeled
ferred annotation can be measured directly by com- data with co-training, Morgan Kaufmann Publishers,
paring the results obtained for a search engine with 1998.
[10] D. Carmel, N. Zwerdling, I. Guy, S. Ofek-Koifman,
and without an enriched multimedia dataset.
N. Har’el, I. Ronen, E. Uziel, S. Yogev, S. Chernov,
Personalized social search based on the user’s social
network, in: Proceedings of the 18th ACM conference on
6. Acknowledgments Information and knowledge management (CIKM ’09),
ACM, 2009.
[11] L. Chen, P. Wright, W. Nejdl, Improving music
We are greatly thankful to our partners University
genre classification using collaborative tagging data, in:
Koblenz/Landau and the Tagora project for pro- Proceedings of the 2nd ACM International Conference
viding the Flickr dataset and from the Knowledge on Web Search and Data Mining (WSDM 2009), ACM,
and Data Engineering Group/Bibsonomy [http:// 2009.
www.bibsonomy.org/] at the University of Kassel [12] M. Claypool, P. Le, M. Waseda, D. Brown, Implicit
interest indicators, in: Proceedings of the ACM
for providing the Del.icio.us dataset. This work was
Intelligent User Interfaces Conference (IUI), ACM, 2000.
partially supported by the PHAROS project funded [13] J. Cohen, A coefficient of agreement for nominal scales,
by the European Commission under the 6th Frame- Educational and Psychological Measurement 20 (1)
work Programme (Contract No. 045035), and the (1960) 37–46.

18
[14] M. Dubinko, R. Kumar, J. Magnani, J. Novak, International Society for Music Information Retrieval
P. Raghavan, A. Tomkins, Visualizing tags over time, Conference (ISMIR 2003), 2003.
in: Proceedings of the 15th International World Wide [29] C. Marlow, M. Naaman, D. Boyd, M. Davis, Ht06,
Web Conference (WWW2006), ACM, 2006. tagging paper, taxonomy, flickr, academic article, to
[15] P. Dunker, S. Nowak, A. Begau, C. Lanz, Content- read, in: Proceedings of the 17th ACM Conference
based mood classification for photos and music: on Hypertext and Hypermedia (HYPERTEXT 2006),
a generic multi-modal classification framework and ACM, 2006.
evaluation approach, in: Proceedings of the 1st ACM [30] M. Naaman, R. Nair, Zonetag’s collaborative tag
international conference on Multimedia information suggestions: What is this person doing in my phone?,
retrieval (MIR’08), ACM, 2008. IEEE MultiMedia 15 (3) (2008) 34–40.
[16] D. Eck, P. Lamere, T. Bertin-Mahieux, S. Green, [31] G. Pass, A. Chowdhury, C. Torgeson, A picture
Automatic generation of social tags for music of search, in: Proceedings of the 1st International
recommendation, in: Advances in Neural Information Conference on Scalable Information Systems (Infoscale
Processing Systems 20, Proceedings of the Twenty-First 2006), ACM, 2006.
Annual Conference on Neural Information Processing [32] T. Rattenbury, N. Good, M. Naaman, Towards
Systems (NIPS), MIT Press, 2007. automatic extraction of event and place semantics
[17] P. Ekman, H. Oster, Facial expressions of emotion, from flickr tags, in: Proceedings of the 30th Annual
Annu. Rev. Psychol. 30 (1979) 527–554. International ACM SIGIR Conference on Research and
[18] B. Fasel, J. Luettin, Automatic facial expression Development in Information Retrieval (SIGIR 2007),
analysis: a survey, Pattern Recognition 36 (1) (2003) ACM, 2007.
259–275. [33] S. Sen, S. K. Lam, A. M. Rashid, D. Cosley,
[19] Y. Feng, Y. Zhuang, Y. Pan, Popular music retrieval D. Frankowski, J. Osterhouse, F. M. Harper,
by detecting mood, in: Proceedings of the 26th Annual J. Riedl, tagging, communities, vocabulary, evolution,
International ACM SIGIR Conference on Research and in: Proceedings of the 2006 ACM Conference on
Development in Information Retrieval (SIGIR 2003), Computer Supported Cooperative Work (CSCW 2006),
ACM, 2003. ACM, 2006.
[20] C. S. Firan, W. Nejdl, R. Paiu, The benefit of [34] P. Shaver, J. Schwartz, D. Kirson, C. O’Connor,
using tag-based profiles, in: Proceedings of the Fifth Emotion knowledge: Further exploration of a prototype
Latin American Web Congress (LA-Web 2007), IEEE approach, Journal of Personality and Social Psychology
Computer Society, 2007. 52 (6) (1987) 1061–1086.
[21] S. A. Golder, B. A. Huberman, Usage patterns of [35] B. Sigurbjörnsson, R. van Zwol, Flickr
collaborative tagging systems, Journal of Information tag recommendation based on collective knowledge, in:
Science 32 (2) (2006) 198–208. Proceedings of the 17th World Wide Web Conference
[22] H. Halpin, V. Robu, H. Shepherd, The complex (WWW2008), ACM, 2008.
dynamics of collaborative tagging, in: Proceedings of [36] S. Sood, S. Owsley, K. Hammond, L. Birnbaum,
the 16th International World Wide Web Conference Tagassist: Automatic tag suggestion for blog posts, in:
(WWW2007), ACM, 2007. Proceedings of the International Conference on Weblogs
[23] P. Heymann, G. Koutrika, H. Garcia-Molina, Can social and Social Media (ICWSM 2007), 2007.
bookmarking improve web search?, in: Proceedings of [37] R. E. Thayer, The biopsychology of mood and arousal,
the 1st ACM International Conference on Web Search Oxford University Press, 1989.
and Data Mining (WSDM 2008), ACM, 2008. [38] Z. Xu, Y. Fu, J. Mao, D. Su, Towards the semantic
[24] A. Hotho, R. Jäschke, C. Schmitz, G. Stumme, web: Collaborative tag suggestions, in: Workshop
Information retrieval in folksonomies: Search and on Collaborative Web Tagging, held at the 15th
ranking, in: The Semantic Web: Research and International World Wide Web Conference, 2006.
Applications, Proceedings of the 3rd European Semantic [39] A. Zollers, Emerging motivations for tagging:
Web Conference (ESWC 2006), vol. 4011 of Lecture Expression, performance, and activism, in: Workshop
Notes in Computer Science, Springer, 2006. on Tagging and Metadata for Social Information
[25] P. Knees, T. Pohle, M. Schedl, G. Widmer, A music Organization, held at the 16th International World Wide
search engine built upon audio-based and web-based Web Conference, 2007.
similarity measures, in: Proceedings of the 30th Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR 2007),
ACM, 2007.
[26] J. R. Landis, G. G. Koch, The measurement of observer
agreement for categorical data, Biometrics 33 (1) (1977)
159–174.
[27] M. Levy, M. Sandler, A semantic space for music derived
from social tags, in: Proceedings of the 8th International
Society for Music Information Retrieval Conference
(ISMIR 2007), 2007.
[28] D. Liu, L. Lu, H.-J. Zhang, Automatic mood detection
from acoustic music data, in: Proceedings of the 4th

19

You might also like