Professional Documents
Culture Documents
10687-Article Text-14215-1-2-20201228
10687-Article Text-14215-1-2-20201228
The Unusual Suspects: Deep Learning Based Mining of Interesting Entity Trivia
from Knowledge Graphs
Abstract Wikipedia, KGs also include data from other rich sources
such as - crowd-sourcing, CIA World Factbook etc.
Trivia is any fact about an entity which is interesting due to A Trivia is any fact about an entity which is interest-
its unusualness, uniqueness or unexpectedness. Trivia could
be successfully employed to promote user engagement in var-
ing due to its unusualness, unexpectedness, uniqueness or
ious product experiences featuring the given entity. A Knowl- weirdness when compared to other entities of similar type
edge Graph (KG) is a semantic network which encodes vari- (Prakash et al. 2015). For example, Yuvraj Singh, a popular
ous facts about entities and their relationships. In this paper, cricket player from the Indian cricket team, authored a book
we propose a novel approach called DBpedia Trivia Miner called ‘Test of my Life’ which is an unexpected event for a
(DTM) to automatically mine trivia for entities of a given cricketer since very few of them have authored books.
domain in KGs. The essence of DTM lies in learning an Trivia, when presented in the form of either a question
Interestingness Model (IM), for a given domain, from hu- or a factoid, helps in attracting human attention since it ap-
man annotated training data provided in the form of inter- peals to their innate sense of curiosity, inquisitiveness and
esting facts from the KG. The IM thus learnt is applied to
extract trivia for other entities of the same domain in the
appreciating novelty (Attfield et al. 2011; O’Brien and Toms
KG. We propose two different approaches for learning the 2010). Due to this, studies have shown that trivia could be
IM - a) A Convolutional Neural Network (CNN) based ap- used to improve user engagement in various product experi-
proach and b) Fusion Based CNN (F-CNN) approach which ences featuring the entity (Wizard 2013; Prakash et al. 2015)
combines both hand-crafted and CNN features. Experiments . For example, during ICC Cricket World Cup 2015, the Bing
across two different domains - Bollywood Actors and Music search engine offered a unique experience1 for cricket re-
Artists reveal that CNN automatically learns features which lated queries which included trivia about teams and players.
are relevant to the task and shows competitive performance As a result, they observed significant improvement in user
relative to hand-crafted feature based baselines whereas F- engagement with each user watching at least 10 trivia per
CNN significantly improves the performance over the base- cricket query. In a pure sense, the notion of human inter-
line approaches which use hand-crafted features alone. Over-
all, DTM achieves an F1 score of 0.81 and 0.65 in Bollywood
estingness is a psychological, cognitive and subjective pro-
Actors and Music Artists domains respectively. cess (Varela, Thompson, and Rosch 1993). However, there
are usually some facts for which there would be significant
agreement, between a majority of people, regarding their in-
1 Introduction terestingness. In this work, we restrict ourselves to such a
majoritarian notion of interestingness and leave the person-
In recent years, significant progress has been made in curat- alized subjective angle for future work.
ing large web-scale semantic networks, popularly known as
In spite of the utility of trivia, the curation process remains
Knowledge Graphs (KGs), which encode various facts about
mostly manual which is hard to scale across millions of en-
entities and their relationships (Berners-Lee et al. 2001;
tities on the web. Recently, some attempts (Prakash et al.
Bizer, Heath, and Berners-Lee 2009; Auer et al. 2007).
2015; Michael Gamon 2014) have been made to automat-
Some of the popular freely available knowledge graphs are
ically mine interesting facts or spans of text from unstruc-
DBpedia (Auer et al. 2007), YAGO2 (Suchanek, Kasneci,
tured Wikipedia or normal text. However, such techniques
and Weikum 2007) and Wikidata (Vrandečić and Krötzsch
rely heavily on the textual representation of facts or context
2014). For example, the English version of DBpedia alone
features to learn about interestingness such as the presence
contains more than 4.22 million entities and close to 3 bil-
of superlative (greatest, longest, best), contradictory words
lion relations (RDF triples). Besides, large commercial com-
(although, however), exclusive words (is the only), verbs
panies like Google, Microsoft, Yahoo! and Baidu have been
within the sentence indicating the core activity (performing a
curating their own KGs. Although, a significant portion of
stunt) etc. On the other hand, in KGs, since the knowledge is
these facts were extracted from unstructured sources like
Copyright c 2017, Association for the Advancement of Artificial 1
http://www.windowscentral.com/bings-cricket-world-cup-
Intelligence (www.aaai.org). All rights reserved. coverage-includes-predictions-polls-and-more
1107
represented in the form of a relation tuple with (subject, re-
lation, object), the above features will not be available. Due
to this, the interestingness of a relation needs to be inferred
based on other features such as the unusuality of a given re-
lation with respect to other entities in the same domain etc.
Recently, deep learning based techniques have shown
significant performance gains across various human in-
telligence tasks (Krizhevsky, Sutskever, and Hinton 2012;
1108
"() ""# "() ""# #
! ## "
#
#
#
#
"
#
"#
*
!"
$
%
&$ '
Figure 2: Fusion Based CNN (F-CNN) for learning the Interestingness Model (IM)
which filters the interesting facts as trivia. For learning IM, 3.2 Hand Crafted Features (HCF)
we model the notion of interestingness as a binary classifica- Given a fact triple (entity, relation/predicate, object), we
tion problem - interesting vs. boring. We show two different compute various features which help in capturing unusual-
ways of learning the IM using a - a) Convolutional Neural ness of an entity fact.
Network (CNN) and b) Fusion based CNN (F-CNN) (Suggu
et al. 2016a; 2016b) which combines both hand-crafted fea- • Predicate Features :
tures and CNN features.
Inverse Entity Frequency (IEF) : In Information
3.1 Convolutional Neural Network for IM Retrieval (IR), Inverse Document Frequency (IDF)
(Sparck Jones 1972) is usually used for measuring the
In this approach, we use a CNN to learn the binary classifi- importance of a term. We apply a similar intuition for our
cation model from the given training data. The CNN takes entity relationships. For a given relation/predicate p, we
a KG fact as input in the form of a triple (entity, object, find how rare the predicate is with respect to the domain
relation) where the entity, object and relation/predicate are D. We believe that entities belonging to the same domain
represented using their word embeddings in the same or- will have many predicates in common and hence rarer
der mentioned above. For entities and objects, instead of predicates would be good candidates for interestingness.
words, we use their word2vec DBpedia entity embeddings We define the KG equivalent of IDF for a given predicate
in 300 dimensions generated using an open source tool2 . p as follows:
The tool is a slightly modified version of word2vec where |D|
instead of treating the entity/object as separate words, they IEF (p) = log
np
are treated as a single unit while generating word2vec em-
beddings. For example, the entire phrase “The Lord of the where |D| denotes the total number of entities in domain
Rings” is treated as a single entity and will have a single em- set D, and np denotes the the number of entities in domain
bedding. For relation/predicate, we use their word2vec em- set D in which the predicate p occurs. For example, in
beddings in 300 dimensions. We fix the length of the pred- Bollywood actors domain, a predicate like starring
icate to 8 words since most predicates are short and have is present in almost all the entities and hence will have
length less than 8 words. If the predicate length is less than a low IEF value whereas a predicate like author will
8, we pad the remaining word vectors with zero. If the length have a high IEF since not many actors in Bollywood have
is greater than 8, we ignore the remaining words. The in- authored a book.
put matrix ((8 + 2) × 300) is then convolved through two
rounds of convolution, pooling and non-linearity layers to Predicate Frequency-Inverse Entity Frequency
get a 300 dimensional feature representation of the input (PF-IEF): Another well known term weighting tech-
triple. We use max-pooling for the pooling layer and Recti- nique in IR is TF-IDF (Salton and Buckley 1988). We
fied Linear Unit (ReLU) (Nair and Hinton 2010) as the non- define its KG equivalent, for an entity e and predicate p
linearity layer. The 300-dimensional vector representation as:
passes through the second stage Neural Network (NN) con- P F -IEF (p, e) = P F (p, e) × IEF (p)
sisting of fully connected layers. These layers model the var- P F (p, e) =Number of RDF triples for entity e in which
ious interactions between the learnt features and finally out- predicate p occurs.
puts a probabilistic score indicative of the interestingness of
the given fact. Although, the system was described in steps, Predicate Bag of Words : This feature converts the
it is trained as a single end-end differentiable system. text of a predicate into a bag of words and then makes
features out of them. Some predicates are potentially
2
https://github.com/idio/wiki2vec more interesting than the rest. For example, owner,
1109
founder, leader. Domain #Entities #Triples #Interesting
Triples
• Semantic Category Features: Bollywood 100 11322 464
Music 100 30514 392
Object Type : We find out the semantic type of
the relation object. Given an RDF triple <e,p,o>,
we find the semantic type of the object o such Table 1: Dataset Statistics
as Person, Place, Organization, Work,
EducationalInstitution, Film, etc. from
DBpedia. This could be a useful feature in capturing to a web page is usually indicative of its popularity.
unusual associations in some predicates. For example, For example, the triple <Michael Jackson, owner,
consider a triple <Jerry Garcia, named after Bubbles(chimpanzee)>. Since, Michael Jackson is
of, 4442 Garcia> which conveys that the planet was popular, people may find unusual facts related to him as
named after him. We find the type information for o from interesting. Hence, for each RDF triple <e,p,o>, we es-
DBpedia as Planet which is interesting as the type timate the relative popularity of the entity e and object o
Planet is a very rare occurrence for a Music domain. using the number of in-links to their wikipedia pages. We
calculate the number of in-links to the Wikipedia page us-
Object Categories : For an RDF triple <e,p,o>, ing the Mediawiki Awk API3 .
we get the various semantic tags associated with the ob-
ject o. Such tags are usually defined using the “subject” 3.3 Fusion Based CNN for IM
predicate. For example, consider the triple <Farhan Figure 2 shows the architecture of F-CNN for IM. As de-
Akhtar, owner, Mumbai Tennis Masters> scribed in Suggu et al. (Suggu et al. 2016a; 2016b), a F-
The entity Mumbai Tennis Masters falls into CNN combines both HCF and automatic features learnt us-
the category like Tennis Teams, and Sports ing the CNN in a single model. The F-CNN is similar to
Teams, which is interesting as most of the entities in a CNN in the first stage. It takes the KG fact as input in
Bollywood domain with predicate owner has cate- the same representation format as prescribed for the CNN.
gories like Companies based in Mumbai, Film Later, the input fact passes through two rounds of convo-
production companies, etc. lution, max-pooling and non-linearity layers to produce a
300-dimensional feature representation (lets call this CNN-
Average Category and Type Entropy : In a given FR). In parallel, using the input fact, F-CNN also computes
domain, most of the objects are associated with some the hand-crafted features described earlier (lets call this HC-
common categories and types which are semantic descrip- FR). In the second stage, these two feature representations
tive tags. For example, the most common object types in are combined together (CNN-FR + HC-FR) and fed as input
the Music domain are MusicalWork, MusicGroup, to the second stage NN which consists of fully connected
Album, etc. Any unusual or rare object category or layers in which the final output node gives the probability
type may turn a fact into a trivia. For example, for of the target class. To avoid unnecessarily processing a large
the triple <Celine Dion, key person, Cystic number of non-informative HCF features through the second
Fibrosis Canada>, the types of the object are stage NN, we perform basic feature selection and then only
Organization, Non-ProfitOrganisation, pick the top 1000 most informative features amongst them.
CharitiesBasedInCanada, For basic feature selection, we choose an embedded model
FinancialInstitution, SocialGroup. Since, with regularization approach (Tang, Alelyani, and Liu 2014;
some of these types are rarely associated with Music Ma and Huang 2008) in which we train a Linear SVM
domain, it is a potential trivia. (SVM-L) with L1 regularization and choose the top 1000
We estimate the distribution of category and type tags features sorted as per their feature importance. For each
for each domain and measure the extent of surprise of an domain, the SVM-L training for feature selection was per-
object category tag or type set C of a given triple using formed using a random sample of 5000 facts sampled from
“Average Category/Type Entropy” which is defined as the dataset of corresponding domain.
follows:
3.4 Training
1 We train the parameters of CNN and F-CNN with an objec-
E (C|M ) = − log2 P r (c|M )
|C| tive to maximize their predication accuracy given the target
cC
where P r(c|M ) is the probability of the category/type tag classes. We perform 10-fold cross validation on the evalu-
c across the entire domain M. E is high if there are many ation dataset. Each random fold was divided into training,
rare categories/types associated with the object. validation and test sets. The training set consisted of the en-
tity triples along with their true binary label. We train the
CNN and F-CNN using the training set and tune the hyper-
• Popularity : Trivia of an entity with higher popular- parameters of the network using the validation set. Since, the
ity may generate greater surprise as opposed to a trivia
3
about a relatively obscure entity. Number of in-links https://github.com/greencardamom/MediaWikiAwkAPI
1110
Hyperparameter CNN F-CNN Rank Bollywood Actors Music Artists
Batch size 128 256 GBC SVC-L GBC SVC-L
Convolution Layers 2 2
1 IEF IEF IEF IEF
Convolution Region Size 3 3
2 Ent.Inlinks distributor Ent.Inlinks business
Max Pooling Units 2 2
3 Cat.Entropy writer players birthplace
Stride 1 1
4 PF-IEF PF-IEF activism recipients
Non-Linearity ReLU ReLU
5 lead people official multi
Optimizer SGD SGD
6 writer bollywood PF-IEF lebanese
Momentum 0.6 0.8
7 Obj.Inlinks culture drowning activism
Dropout Rate 0.2 0.1
8 beauty men armenian australia
Fully Connected Layers 2 4
9 TypeEntropy playback canadian drowning
10 ancient child multi musician
Table 2: Tuned Hyper-parameters which were used for train-
ing CNN and F-CNN Models
Table 4: Top 10 Informative Features for the SVM-L and
GBC Baselines.
Model Bollywood Actors Music Artists
Prec. Rec. F1 Prec. Rec. F1
b) Music Artists. The above domains were chosen since we
F-CNN 0.83* 0.79 0.81* 0.72* 0.59 0.65*
believe an average human annotator available to us would
CNN 0.79 0.72 0.75* 0.69 0.55 0.59*
have sufficient knowledge to judge the interestingness of the
GBC 0.77 0.67 0.71 0.69 0.41 0.48 facts in these domains. For each of the above domains, we
SVC-RBF 0.71 0.71 0.71 0.53 0.58 0.54 extracted all the entities and their RDF relationships from
SVC-L 0.66 0.81 0.72 0.5 0.68 0.55 DBpedia - 1113 entities, 58633 RDF relationships (Bolly-
wood Actors) and 9232 entities, 665631 RDF relationships
(Musical Artists) respectively. Since, an exhaustive manual
Table 3: 10-Fold Cross Validation Results: DTM variants evaluation of all the above relations across entities was not
(CNN and F-CNN) in comparison with other baseline ap- practically feasible, we chose a smaller subset of entities for
proaches. Results marked with a * were found to be statisti- our evaluation purpose. For the Bollywood Actors domain,
cally significant with respect to the nearest baseline at 95% we chose the most popular 100 actors from the above list
confidence level (α = 0.05) when tested using a two-tailed who were either - a) listed in the Forbes top 100 celebrity
paired t-test. All the models were trained sensitive to class list in the past four years4 or b) listed in the IMDB top 100
imbalance. Bollywood actors of all time5 . For Musical Artists, we se-
lected the top 100 artists from the Billboard6 list for the past
number of hyper-parameters and their combinations is high five years.
for CNN and F-CNN, we tuned their hyper-parameters us- For getting the gold standard label for the interestingness
ing the validation set of the first fold alone and then use the of a fact, we employed five independent human annotators
same combination across the remaining folds. The optimal who were exclusively used for this task. The annotators were
hyper-parameter configuration thus found is shown in Table given clear guidelines and training on how to judge the inter-
2. If t is the true label and o is the output of the network estingness of facts along with some sample judgments from
with the current weight configuration, we use Binary Cross each domain. For each domain, from the selected 100 en-
Entropy (BCE) as the loss function which is calculated as tities, we extracted all their associated facts (triples) and
follows: asked judges to rate them for interestingness on a binary
class scale: Interesting or Boring. The majority judgment
BCE(t, o) = −(t · log(o) + (1 − t) · log(1 − o)) from the judges was then taken as the true interestingness
We use Stochastic Gradient Descent (SGD) as the optimiza- label for each fact. The inter-annotator agreement using the
tion routine and the models were trained by minimizing the Kappa statistic (Fleiss 1971) was found to be 0.52 and 0.48
above loss function in a batch size of n which was tuned for the Bollywood and the Music domain respectively, which
differently for CNN and F-CNN and given in Table 2. is a moderate agreement.
The final statistics regarding the dataset along with the
4 Experimental Setup annotated judgments for each domain is given in Table 1.
In this section, we describe our experimental setup which In the interest of reproducibility of our results and promot-
includes details related to our dataset, evaluation metrics and ing research in this area, we are making our entire dataset
baseline approaches. public for research purposes. It can be downloaded from the
4.1 Dataset 4
http://forbesindia.com/lists/2015-celebrity-100/1519/1
5
For our experiments, we chose two popular domains within http://www.imdb.com/list/ls074758327/
6
DBpedia - a) Bollywood Actors (Indian Movie Actors) and http://www.billboard.com/artists/top-100
1111
DTM Variant Entity Trivia Description (from Wikipedia Page)
Marvin Gaye <Marvin Gaye, subject, United A 17-year-old Marvin ran away from home
(Music Artist) States Air Force airmen> and joined the US Air Force as a Basic Air-
man.
F-CNN Sanjay Dutt <Sanjay Dutt, founder, Super The Super Fight League is an India based
(Bollywood Actor) Fight League> mixed martial arts promotion which is also
biggest in Asia.
Akon <Akon, owner, Akon Lighting Akon Lighting Africa was started in 2014
(Music Artist) Africa> and aims to provide electricity by solar en-
ergy in Africa.
CNN Farhan Akhtar <Farhan Akhtar, owner, Mumbai Farhan Akhtar is the owner of Mumbai
(Bollywood Actor) Tennis Masters> Tennis Masters Club.
Tupak Shakur <Tupak Shakur, subject, On September 7, 1996, Shakur was fatally
(Music Artist) Murdered Rappers> shot in a drive-by shooting at Las Vegas,
Nevada.
Table 5: Representative Samples of Interesting Trivia mined using two variants of DTM.
1112
volutional Neural Network (CNN) based approach and b) Memoranda in Ontological Engineering and Knowledge Ana-
Fusion Based CNN which combined both hand-crafted fea- lytics MOEKA. 2015.6 Recent Papers on Ontological Engineer-
tures and automatically learnt CNN features. Experiments ing 103.
across two different domains - Bollywood Actors and Music McGarry, K. 2005. A Survey of Interestingness Measures for
Artists revealed that the CNN based approach automatically Knowledge Discovery. The Knowledge Engineering Review
learns features which are relevant to the task and achieves 20(1):39–61.
competitive performance relative to the hand-crafted feature
Michael Gamon, Arjun Mukherjee, P. P. 2014. Predicting In-
based baselines. Fusion Based CNN (F-CNN) significantly
teresting Things in Text. In COLING ’14. ACL.
improves the performance over baseline approaches which
use hand-crafted features alone. As part of future work, we Nair, V., and Hinton, G. E. 2010. Rectified Linear Units Im-
would like to extend our work to include user feedback sig- prove Restricted Boltzmann Machines. In Proceedings of the
nals regarding interestingness so that the notion of interest- 27th International Conference on Machine Learning (ICML-
ingness could be gradually personalized to a given user. 10), 807–814.
O’Brien, H. L., and Toms, E. G. 2010. The Development and
References Evaluation of a Survey to Measure User Engagement. Journal
Akoglu, L.; Tong, H.; and Koutra, D. 2015. Graph Based of the American Society for Information Science and Technol-
Anomaly Detection and Description: A Survey. Data Mining ogy 61(1):50–69.
and Knowledge Discovery 29(3):626–688. Perozzi, B.; Akoglu, L.; Iglesias Sánchez, P.; and Müller, E.
Attfield, S.; Kazai, G.; Lalmas, M.; and Piwowarski, B. 2011. 2014. Focused Clustering and Outlier Detection in Large At-
Towards a Science of User Engagement. In WSDM Workshop tributed Graphs. In Proceedings of the 20th ACM SIGKDD
on User Modelling for Web Applications. International Conference on Knowledge Discovery and Data
Mining, 1346–1355. ACM.
Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.;
and Ives, Z. 2007. Dbpedia: A Nucleus for a Web of Open Prakash, A.; Chinnakotla, M. K.; Patel, D.; and Garg, P. 2015.
Data. In The Semantic Web. Springer. 722–735. Did You Know? Mining Interesting Trivia for Entities from
Wikipedia. In Proceedings of the 24th International Confer-
Berners-Lee, T.; Hendler, J.; Lassila, O.; et al. 2001. The Se- ence on Artificial Intelligence, 3164–3170. AAAI Press.
mantic Web. Scientific American 284(5):28–37.
Raghavan, V.; Bollmann, P.; and Jung, G. S. 1989. A Critical
Bizer, C.; Heath, T.; and Berners-Lee, T. 2009. Linked Data- Investigation of Recall and Precision as Measures of Retrieval
The Story So Far. Semantic Services, Interoperability and Web System Performance. ACM Transactions on Information Sys-
Applications: Emerging Concepts 205–227. tems (TOIS) 7(3):205–229.
Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, Salton, G., and Buckley, C. 1988. Term-Weighting Approaches
K.; and Kuksa, P. P. 2011. Natural Language Processing (al- in Automatic Text Retrieval. Information Processing & Man-
most) from Scratch. CoRR abs/1103.0398. agement 24(5):513–523.
Cortes, C., and Vapnik, V. 1995. Support Vector Networks. Sparck Jones, K. 1972. A Statistical Interpretation of Term
Machine Learning 20(3):273–297. Specificity and its Application in Retrieval. Journal of Docu-
Dahl, G. E. 2015. Deep Learning Approaches to Problems mentation 28(1):11–21.
in Speech Recognition, Computational Chemistry, and Natural Suchanek, F. M.; Kasneci, G.; and Weikum, G. 2007. Yago:
Language Text Processing. Ph.D. Dissertation, University of A Core of Semantic Knowledge. In Proceedings of the 16th
Toronto. International Conference on World Wide Web, 697–706. ACM.
Fleiss, J. L. 1971. Measuring Nominal Scale Agreement Suggu, S. P.; Goutham, K. N.; Chinnakotla, M. K.; and Shri-
Among Many Raters. Psychological bulletin 76(5):378. vastava, M. 2016a. Deep Feature Fusion Network for Answer
Friedman, J. H. 2001. Greedy Function Approximation: A Quality Prediction in Community Question Answering. In SI-
Gradient Boosting Machine. Annals of Statistics 1189–1232. GIR ’16. ACM.
Geng, L., and Hamilton, H. J. 2006. Interestingness Measures Suggu, S. P.; Goutham, K. N.; Chinnakotla, M. K.; and Shri-
for Data Mining: A Survey. ACM Computing Surveys (CSUR) vastava, M. 2016b. Hand In Glove: Deep Feature Fusion Net-
38(3):9. work Architectures for Answer Quality Prediction in Commu-
Kontonasios, K.-N.; Spyropoulou, E.; and De Bie, T. 2012. nity Question Answering. In COLING ’16. ACL.
Knowledge Discovery Interestingness Measures Based On Un- Tang, J.; Alelyani, S.; and Liu, H. 2014. Feature Selection for
expectedness. Wiley Interdisciplinary Reviews: Data Mining Classification: A Review. Data Classification: Algorithms and
and Knowledge Discovery 2(5):386–399. Applications 37.
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Ima- Varela, F. J.; Thompson, E.; and Rosch, E. 1993. The Em-
genet Classification with Deep Convolutional Neural Networks. bodied Mind: Cognitive Science and Human Experience. Cam-
In Advances in Neural Information Processing Systems, 1097– bridge/MA: The MIT Press.
1105. Vrandečić, D., and Krötzsch, M. 2014. Wikidata: A Free
Ma, S., and Huang, J. 2008. Penalized Feature Selection and Collaborative Knowledge Base. Communications of the ACM
Classification in Bioinformatics. Briefings in Bioinformatics 57(10):78–85.
9(5):392–403. Wizard, P. M. 2013. Trivia Game on Facebook.
Mahesh, K., and Karanth, P. 2015. Smart-aleck: An Inter-
estingness Algorithm for Large Semantic Datasets. MOEKA-
1113