Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

ALI et al.

Bringing Light Into the Dark:


A Large-scale Evaluation of Knowledge Graph
Embedding Models under a Unified Framework
Mehdi Ali, Max Berrendorf† , Charles Tapley Hoyt† , Laurent Vermue† , Mikhail Galkin,
Sahand Sharifzadeh, Asja Fischer, Volker Tresp, and Jens Lehmann
arXiv:2006.13365v3 [cs.LG] 29 Dec 2020

Abstract—The heterogeneity in recently published knowledge graph e.g., question answering, dialogue systems, and recommen-
embedding models’ implementations, training, and evaluation has made dation systems [1]. Knowledge graph embedding models
fair and thorough comparisons difficult. To assess the reproducibility (KGEMs) present an avenue for predicting missing links.
of previously published results, we re-implemented and evaluated 19 However, the following two major challenges remain in
models in the PyKEEN software package. In this paper, we outline which
their application.
results could be reproduced with their reported hyper-parameters, which
could only be reproduced with alternate hyper-parameters, and which
First, the reproduction of previously reported results
could not be reproduced at all, as well as provide insight as to why this turned out to be a major challenge — there are even exam-
might be the case. ples of different results reported for the same combinations
We then performed a large-scale benchmarking on four datasets of KGEMs and datasets [2]. In some cases, the lack of
with several thousands of experiments and 21,246 GPU hours of com- availability of source code for KGEMs or the usage of dif-
putation time. We present insights gained as to best practices, best ferent frameworks and programming languages inevitably
configurations for each model, and where improvements could be made introduces variability. In other cases, the lack of a precise
over previously published best configurations. Our results highlight that specification of hyper-parameters introduces variability.
the combination of model architecture, training approach, loss function,
Second, the verification of the novelty of previously
and the explicit modeling of inverse relations is crucial for a model’s
performance and is not only determined by its architecture. We provide
reported results remains difficult. It is often difficult to
evidence that several architectures can obtain results competitive to the attribute the incremental improvements in performance re-
state of the art when configured carefully. We have made all code, exper- ported with each new state of the art model to the model’s
imental configurations, results, and analyses available at https://github. architecture itself or instead to the training approach, hyper-
com/pykeen/pykeen and https://github.com/pykeen/benchmarking. parameter values, or specific prepossessing steps, e.g., the
explicit modeling of inverse relations. It has been shown that
Index Terms—Knowledge Graph Embeddings, Link Prediction, Repro- baseline models can achieve competitive performance to
ducibility, Benchmarking more sophisticated ones when optimized appropriately [3],
[2]. Additionally, the variety of implementations and inter-
pretations of common evaluation metrics for link prediction
1 I NTRODUCTION
makes a fair comparison to previous results difficult [4].

A S the usage of knowledge graphs (KGs) becomes more


widespread, their inherent incompleteness can pose
a liability for typical downstream tasks that they support,
This paper makes two major contributions towards ad-
dressing these challenges:
1) We performed a reproducibility study in which we
†Equal contribution. tried to replicate reported experimental results in
Mehdi Ali is affiliated with Smart Data Analytics (University of Bonn), the original papers (when sufficient information
Germany, & Fraunhofer IAIS, Sankt Augustin and Dresden, Germany.
Max Berrendorf is affiliated with Ludwig-Maximilians-Universität München,
was provided).
Munich, Germany. 2) We performed an extensive benchmark study on
Charles Tapley Hoyt is affiliated with Enveda Biosciences, Bonn, Germany. 19 KGEMs over four benchmark datasets in which
Laurent Vermue is affiliated with the Technical University of Denmark, we evaluated the models based on different hyper-
Kongens Lyngby, Denmark.
Mikhail Galkin is affiliated with the Technical University of Dresden, Ger- parameter values, training approaches (i.e. training
many, & Fraunhofer IAIS, Sankt Augustin and Dresden, Germany. under the local closed world assumption and stochastic
Sahand Sharifzadeh is affiliated with Ludwig-Maximilians-Universität local closed world assumption), loss functions, optimiz-
München, Munich, Germany.
Asja Fischer is affiliated with the Ruhr University Bochum, Germany.
ers, and the explicit modeling of inverse relations.
Volker Tresp is affiliated with Ludwig-Maximilians-Universität München & Previous studies have already investigated important
Siemens AG, Munich, Germany.
Jens Lehmann is affiliated with Smart Data Analytics (University of Bonn), aspects for a subset of models: Kadlec et al. [3] showed
Bonn, Germany, & Fraunhofer IAIS, Sankt Augustin and Dresden Germany. that a fine-tuned baseline (DistMult [5]) can outperform
ALI et al. 2

more sophisticated models on FB15K. Akrami et al. [2], [6]


examined the effect of removing faulty triples from KGs
on the model’s performance. Mohamed et al. [7] studied
the influence of loss functions on the models’ performances
for a set of KGEMs. Concurrent to the work on this paper,
Rufinelli et al. [8] performed a benchmarking study in which
they investigated five knowledge graph embedding models.
After describing their benchmarking [8], they called for a
larger study that extends the search space and incorporates
more sophisticated models. Our study answers this call and
realizes a fair benchmarking by completely re-implementing
KGEMs, training pipelines, loss functions, regularizations,
and evaluation metrics in a unified, open-source framework. Fig. 1. Exemplary KG: nodes represent entities and edges their respec-
Inspired by their findings, we have also included the cross tive relations.
entropy loss (CEL) function, which has been previously
used by Kadlec et al. [3]. Our benchmarking can be con-
on the OWA as well as most of the relevant works to this
sidered as a superset of many previous benchmarkings —
paper [13], [11].
to the best of our knowledge, there exists no study of Because KGs are usually incomplete and noisy, several
comparable breadth or depth. A further interesting study approaches have been developed to predict new links. In
with a different focus is the work of Rossi et al. [9] in which particular, the task of link prediction is defined as predicting
they investigated the effect of the structural properties of the tail/head entities for (h, r)/(r, t) pairs. For instance,
KGs on models’ performances, instead of focusing on the given queries of the form (Sarah, studied at, ?) or (?, CEO of,
combinations of different model architectures, training ap- Deutsche Bank), the task is the correctly detect the entities
proaches, and loss functions. that answer the query, i.e. (Sarah, studied at, University of
This article is structured as follows: in Section 2, we Oxford) and (Sarah, CEO of, Deutsche Bank). While classical
introduce our notation of KG and the link prediction task approaches have relied on domain-specific rules to derive
and introduce an exemplary KG to which we refer in missing links, they usually require a large number of user-
examples throughout this paper. In Section 3, we present defined rules in order to generalize [11]. Alternatively, ma-
our definition of a KGEM and review the KGEMs that we chine learning approaches learn to predict new links based
investigated in our studies. In Section 4, we describe and on the set of existing ones. It has been shown that espe-
discuss established evaluation metrics as well as a recently cially relational-machine learning methods are successful in
proposed one [10]. In Section 5, we introduce the bench- predicting missing links and identifying incorrect ones, and
mark datasets on which we conducted our experiments. In recently knowledge graph embedding models have gained
Section 6 and Section 7, we present our respective repro- significant attention [11].
ducibility and benchmarking studies. Finally, we provide a
discussion and an outlook for our future work in Section 8.
3 K NOWLEDGE GRAPH EMBEDDING MODELS
Knowledge graph embedding models (KGEMs) learn latent
2 K NOWLEDGE G RAPHS vector representations of the entities e ∈ E and relations
r ∈ R in a KG that best preserve its structural properties [1],
For a given set of entities E and set of relations R, we con- [11]. Besides for link prediction, they have been used for
sider a knowledge graph K ⊆ K = E × R × E as a directed, tasks such as entity disambiguation, and clustering as well
multi-relational graph that comprises triples (h, r, t) ∈ K as for downstream tasks such as question answering, rec-
in which h, t ∈ E represent a triples’ respective head and ommendation systems, and relation extraction [1]. Figure 2
tail entities and r ∈ R represents its relationship. Figure 1 shows an embedding of the entities and relations in R2 from
depicts an exemplary KG. The direction of a relationship the KG from Figure 1.
indicates the roles of the entities, i.e., head or tail entity. Here, we define a KGEM as four components: an in-
For instance, in the triple (Sarah, CEO Of, Deutsche Bank), teraction model, a training approach, a loss function, and its
Sarah is the head and Deutsche Bank is the tail entity. KGs usage of explicit inverse relations. This abstraction enables
usually contain only true triples corresponding to available investigation of the effect of each component individually
knowledge. and in combination on each KGEMs’ performance. Each
In contrast to triples in a KG, there are different philoso- are described in detail in their following respective subsec-
phies, or assumptions, for the consideration of triples not con- tions 3.1, 3.2, 3.3, and 3.4.
tained in a KG [11], [12]. Under the closed world assumption In this paper, we use a boldface lower-case letter x to
(CWA), all triples that are not part of a KG are considered denote a vector, kxkp to represent its lp norm, a boldface
as false. Based on the example in Figure 1, the triple (Sarah, upper-case letter X to denote a matrix, and a fraktur-
lives in, Germany) is a false fact under the CWA since it is not font upper-case letter X to represent a three-mode tensor.
part of the KG. Under the open world assumption (OWA), Furthermore, we use to denote the Hadamard product
it is considered unknown as to whether triples that are not : Rd × Rd → Rd :
part of the KG are true or false. The construction of KGs
under the principles of the semantic web (and RDF) rely [a b]i = ai · bi (1)
ALI et al. 3

1-N, N-1, and N-M relations: assume (h, r, t1 ), (h, r, t2 ) ∈ K,


then the model adapts the embeddings in order to ensure
h + r ≈ t1 and h + r ≈ t2 which results in t1 ≈ t2 .
3.1.0.4 TransH: TransH [17] is an extension of
TransE that specifically addresses the limitations of TransE
in modeling 1-N, N-1, and N-M relations. In TransH, each
relation is represented by a hyperplane, or more specifically
a normal vector of this hyperplane wr ∈ Rd , and a vector
dr ∈ Rd that lies in the hyperplane. To compute the plausi-
bility of a triple (h, r, t) ∈ K, the head embedding h ∈ Rd
and the tail embedding t ∈ Rd are first projected onto the
Fig. 2. An example embedding of the entities and relations from the relation-specific hyperplane:
knowledge graph portrayed by Figure 2.
hr = h − w>
r hwr , (5)

Finally, we use x to denote the conjugate of a complex


number x ∈ C. tr = t − w>
r twr . (6)
Then, the projected embeddings are used to compute the
3.1 Interaction Models score for the triple (h, r, t):
An interaction model f : E × R × E → R computes a f (h, r, t) = −khr + dr − tr k22 . (7)
real-valued score representing the plausibility of a triple
(h, r, t) ∈ K given the embeddings for the entities and 3.1.0.5 TransR: TransR [18] is an extension of
relations. In general, a larger score indicates a higher plau- TransH that explicitly considers entities and relations as
sibility. The interpretation of the score value is model- different objects and therefore represents them in different
dependent, and usually it cannot be directly interpreted as a vector spaces. For a triple (h, r, t) ∈ K, the entity embed-
probability. In this subsection, we review several previously dings, h, t ∈ Rd , are first projected into the relation space by
published interaction models. means of a relation-specific projection matrix Mr ∈ Rk×d :
3.1.0.1 Unstructured Model [?]: The Unstructured
hr = Mr h , (8)
Model (UM) [14] scores a triple by computing the distance
between the head and tail entity
tr = Mr t . (9)
f (h, t) = −kh − tk22 , (2)
Finally, the score of the triple (h, r, t) is computed:
where h, t ∈ Rd are the embeddings of head and tail entity,
respectively. A small distance between these embeddings f (h, r, t) = −khr + r − tr k22 (10)
indicates a plausible triple. In the UM, relations are not k
considered, and therefore, it cannot distinguish between where r ∈ R .
different relationship types. However, the model can be 3.1.0.6 TransD: TransD [19] is an extension of
beneficial for learning embeddings for KGs that contain only TransR that, like TransR, considers entities and relations as
a single relationship type or only equivalent relationship objects living in different vector spaces. However, instead
types, e.g. GrandmotherOf and GrandmaOf. Moreover, it may of performing the same relation-specific projection for all
serve as a baseline to interpret the performance of relation- entity embeddings, entity-relation-specific projection matri-
aware models. ces Mr,h , Mt,h ∈ Rk×d are constructed. To do so, all head
3.1.0.2 Structured Embedding: Structured Embed- entities, tail entities, and relations are represented by two
ding (SE) [15] models each relation by two matrices vectors, h, hp , t, tp ∈ Rd and r, rp ∈ Rk , respectively. The
Mhr , Mtr ∈ Rd×d that perform relation-specific projections first set of embeddings is used for calculating the entity-
of the head and tail embeddings: relation-specific projection matrices:

f (h, r, t) = −kMhr h − Mtr tk1 . (3) Mr,h = rp hTp + Ĩ , (11)

As before, h, t ∈ Rd are the embeddings of head and tail


entity, respectively. By employing different projections for Mr,t = rp tTp + Ĩ , (12)
the embeddings of the head and tail entities, SE explicitly where Ĩ ∈ Rk×d is a k × d matrix with ones on the diag-
distinguishes between the subject- and object-role of an onal and zeros elsewhere. Next, h and t are projected into
entity. the relation space by means of the constructed projection
3.1.0.3 TransE: TransE [16] models relations as a matrices:
translation of head to tail embeddings, i.e. h + r ≈ t. Thus, hr = Mr,h h , (13)
the interaction model is defined as:
f (h, r, t) = −kh + r − tkp , (4) tr = Mr,t t . (14)
with p ∈ {1, 2} is a hyper-parameter. A major advantage of
Finally, the plausibility score for (h, r, t) ∈ K is given by:
TransE is its computational efficiency which enables its us-
age for large scale KGs. However, it inherently cannot model f (h, r, t) = −khr + r − tr k22 . (15)
ALI et al. 4

3.1.0.7 RESCAL: RESCAL [20] is a bilinear model Therefore, for each triple (e1 , r, e2 ) ∈ K, both he1 and te2 as
that models entities as vectors and relations as matrices. well as he2 and te1 are updated [23].
The relation matrices Wr ∈ Rd×d contain weights wi,j that 3.1.0.12 TuckER: TuckER [24] is a linear model that
capture the amount of interaction between the i-th latent is based on the tensor factorization method Tucker [25] in
factor of h ∈ Rd and the j -th latent factor of t ∈ Rd [11], which a three-mode tensor X ∈ RI×J×K is decomposed
[20]. Thus, the plausibility score of (h, r, t) ∈ K is given by: into a set of factor matrices A ∈ RI×P , B ∈ RJ×Q , and
d X
d C ∈ RK×R and a core tensor Z ∈ RP ×Q×R (of lower rank):
(r)
X
f (h, r, t) = hT Wr t = wij hi tj (16) X ≈ Z ×1 A ×2 B ×3 C , (22)
i=1 j=1

3.1.0.8 DistMult: DistMult [5] is a simplification where ×n is the tensor product, with n denoting along
of RESCAL where the relation matrices Wr ∈ Rd×d are which mode the tensor product is computed. In TuckER,
restricted to diagonal matrices: a KG is considered as a binary tensor which is factorized
using the Tucker factorization where E = A = C ∈ Rne ×de
d
X denotes the entity embedding matrix, R = B ∈ Rnr ×dr
f (h, r, t) = hT Wr t = hi · diag(Wr )i · ti . (17)
represents the relation embedding matrix, and W = Z ∈
i=1
Rde ×dr ×de is the core tensor that indicates the extent of
Because of its restriction to diagonal matrices DistMult is interaction between the different factors. The interaction
computational more efficient than RESCAL, but at the same model is defined as:
time less expressive. For instance, it is not able to model
anti-symmetric relations, since f (h, r, t) = f (t, r, h). f (h, r, t) = W ×1 h ×2 r ×3 t , (23)
3.1.0.9 ComplEx: ComplEx [21] is an extension of
where h, t correspond to rows of E and r to a row of R.
DistMult that uses complex valued representations for the
3.1.0.13 ProjE: ProjE [26] is a neural network-based
entities and relations. Entities and relations are represented
approach with a combination and a projection layer. The
as vectors h, r, t ∈ Cd , and the plausibility score is computed
interaction model first combines h and r by following
using the Hadamard product:
combination operator [26]:
f (h, r, t) =Re(h r t)
h ⊗ r = De h + Dr r + bc , (24)
= hRe(h), Re(r), Re(t)i
+ hIm(h), Re(r), Im(t)i (18) where De , Dr ∈ Rk×k are diagonal matrices which are used
as shared parameters among all entities and relations, and
+ hRe(h), Re(r), Im(t)i
bc ∈ Rk represents the candidate bias vector shared across
− hIm(h), Im(r), Im(t)i , all entities. Next, the score for the triple (h, r, t) ∈ K is
where Re(x) and Im(x) denote the real and imaginary parts computed:
of the complex valued vector x. Because the Hadamard
f (h, r, t) = g(t z(h ⊗ r) + bp ) , (25)
product is not commutative in the complex space, ComplEx
can model anti-symmetric relations in contrast to DistMult. where g and z are activation functions, and bp represents
3.1.0.10 RotatE: RotatE [22] models relations as ro- the shared projection bias vector.
tations from head to tail entities in the complex space: 3.1.0.14 HolE: Holographic embeddings (HolE) [27]
make use of the circular correlation operator to compute
t=h r , (19)
interactions between latent features of entities and relations:
d
where h, r, t ∈ C and |ri | = 1, that is the complex elements
of r are restricted to have a modulus of one. Because of the f (h, r, t) = σ(rT (h ? t)) . (26)
latter, ri can be represented as eiθr,i , which corresponds to
a counterclockwise rotation by θr,i radians. The interaction where the circular correlation ? : Rd × Rd → Rd is defined
model is then defined as: as:
d−1
X
f (h, r, t) = −kh r − tk , (20) [a ? b]i = ak ∗ b(i+k) mod d (27)
k=0
which allows to model symmetry, antisymmetry, inversion,
and composition [22]. By using the correlation operator each component [h ? t]i
3.1.0.11 SimplE: SimplE [23] is an extension of represents a sum over a fixed partition over pairwise inter-
canonical polyadic (CP) [23], one of the early tensor factor- actions. This enables the model to put semantic similar in-
ization approaches. In CP, each entity e ∈ E is represented teractions into the same partition and share weights through
by two vectors he , te ∈ Rd and each relation by a single r. Similarly irrelevant interactions of features could also be
vector r ∈ Rd . Depending whether an entity participates in placed into the same partition which could be assigned a
a triple as the head or tail entity, either he or te is used. small weight in r.
Both entity representations are learned independently, i.e. 3.1.0.15 KG2E: KG2E [28] aims to explicitly model
observing a triple (e1 , r, e2 ), the method only updates he1 (un)certainties in entities and relations (e.g. influenced by
and te2 . In contrast to CP, SimplE introduces for each rela- the number of triples observed for these entities and re-
tion r the inverse relation r0 , and formulates the interaction lations). Therefore, entities and relations are represented
model based on both: by probability distributions, in particular by multi-variate
1 Gaussian distributions N i (µi , Σi ) where the mean µi ∈ Rd
hei , r, tej + hej , r0 , tei

f (h, r, t) = . (21) denotes the position in the vector space and the diagonal
2
ALI et al. 5

variance Σi ∈ Rd×d models the uncertainty. Inspired by TABLE 1


the TransE model, relations are modeled as transformations Investigated interaction models [27] and their required number of
parameters. k corresponds to the number of neurons in the hidden
from head to tail entities: H − T ≈ R where H ∼ layer, nf to the number of convolutional kernels, kr and kc to the height
N h (µh , Σh ), H ∼ N t (µt , Σt ), R ∼ P r = N r (µr , Σr ) and width of the convolutional kernels.
and H − T ∼ P e = N h−t (µh − µt , Σh + Σt ) (since
head and tail entities are considered to be independent with Model Parameters
regards to the relations). The interaction model measures the ComplEx a
|E|2d + |R|2d
similarity between P e and P r by means of the Kullback- |E|d + |R|d + d + nf kr kc + 2 + 2nf + 2d
ConvEb
Leibler (KL) divergence: +(h − kr + 1)(w − kc + 1)nf d + |E|
ConvKB |E|d + |R|d + nf (d + 4) + 1
f (h, r, t) = DKL (P e , P r ) DistMult |E|d + |R|d
ER-MLP |E|d + |R|d + k(3d + 2) + 1
1n HolE |E|d + |R|d
= tr(Σ−1 T −1
r Σe ) + (µr − µe ) Σr (µr − µe )
2 (28) KG2E 2|E|d + 2|R|d
det(Σe ) o NTN |E|d + |R|k(d2 + 2d + 2)
−log( )−d . ProjE |E|d + |R|d + 3d + 1
det(Σr ) RESCAL |E|d + |R|d2
RotatEa |E|2d + |R|2d
Besides the asymmetric KL divergence, the authors propose SE |E|d + 2|R|d2
a symmetric variant which uses the expected likelihood: SimplE 2|E|d + 2|R|d
TransE |E|d + |R|d
f (h, r, t) = log DEL (P e , P r ) TransH |E|d + 2|R|d
TransR |E|de + |R|dr + de dr
1 n
−1 UM |E|d
= (µe − µr )T (Σe + Σr ) (µe − µr ) (29)
2 TuckER |E|de + |R|dr + d2e dr + 4de
o
a
+ log det(Σe + Σr ) + ke log(2π) . 2d, because of complex valued vectors, i.e. imaginary
and real part of a number.
b
w and h correspond to the height and weight of the
3.1.0.16 ERMLP: ERMLP [29] is a multi-layer per- reshaped input.
ceptron based approach that uses a single hidden layer and
represents entities and relations as vectors. In the input-
layer, for each triple the embeddings of head, relation, and embedded triple. Each ωi is applied on every row of A
tail are concatenated and passed to the hidden layer. The creating a feature map vi = [vi,1 , ..., vi,d ] ∈ Rd :
output-layer consists of a single neuron that computes the
plausibility score of the triple: vi = g(ωj A + b) , (32)

f (h, r, t) = wT g(W[h; r; t]), (30) where b ∈ R denotes a bias term and g an activation
function which is employed element-wise. Based on the
where W ∈ Rk×3d represents the weight matrix of the resulting feature maps v1 , . . . , vτ , the plausibility score of
hidden layer, w ∈ Rk , the weights of the output layer, a triple is given by:
and g denotes an activation function such as the hyperbolic
tangent. f (h, r, t) = [vi ; . . . ; vτ ] · w , (33)
3.1.0.17 Neural Tensor Network: The Neural Ten-
sor Network (NTN) [30] uses a bilinear tensor layer instead where [vi ; . . . ; vτ ] ∈ Rτ d×1 and w ∈ Rτ d×1 is a shared
of a standard linear neural network layer: weight vector. ConvKB may be seen as a restriction of ER-
MLP with a certain weight sharing pattern in the first layer.
f (h, r, t) = uTr · tanh(hWr t + Vr [h; t] + br ) , (31) 3.1.0.19 ConvE: ConvE [32] is a CNN-based ap-
proach. For each triple (h, r, t), the input to ConvE is a
where Wr ∈ Rd×d×k is the relation specific tensor, and the matrix A ∈ R2×d where the first row of A represents h ∈ Rd
weight matrix Vr ∈ Rk×2d , the bias vector br , and the and the second row represents r ∈ Rd . A is reshaped to a
weight vector ur ∈ Rk are the standard parameters of a matrix B ∈ Rm×n where the first m/2 half rows represent
neural network, which are also relation specific. The result h and the remaining m/2 half rows represent r. In the
of the tensor product hWr t is a vector x ∈ Rk where each convolution layer, a set of 2-dimensional convolutional filters
entry xi is computed based on the slice i of the tensor Wr : Ω = {ωi | ωi ∈ Rr×c } are applied on B that capture
xi = hWir t [30]. As indicated by the interaction model, NTN interactions between h and r. The resulting feature maps
defines for each relation a separate neural network which are reshaped and concatenated in order to create a feature
makes the model very expressive, but at the same time vector v ∈ R|Ω|rc . In the next step, v is mapped into the
computationally expensive. entity space using a linear transformation W ∈ R|Ω|rc×d ,
3.1.0.18 ConvKB: ConvKB [31] uses a convolu- that is eh,r = vT W. The score for the triple (h, r, t) ∈ K is
tional neural network (CNN) whose feature maps capture then given by:
global interactions of the input. Each triple (h, r, t) ∈ K f (h, r, t) = eh,r t . (34)
is represented as a input matrix A = [h; r; t] ∈ Rd×3 in
which the columns represent the embeddings for h, r and Since the interaction model can be decomposed into
t. In the convolution layer, a set of convolutional filters f (h, r, t) = hf 0 (h, r), ti, the model is particularly designed
ωi ∈ R1×3 , i = 1, . . . , τ, are applied on the input in order to 1-N scoring, i.e. efficient computation of scores for (h, r, t)
to compute for each dimension global interactions of the for fixed h, r and many different t.
ALI et al. 6

LCWA sLCWA
Anna
[
N = N (h, r, t) (35)
Car (h,r,t)∈K
DHL
Deutsche_Bank N (h, r, t) = T (h, r) ∪ H(r, t) (36)
Germany T (h, r) = {(h, r, t0 ) | t0 ∈ E ∧ t0 6= t} (37)
John
Peter H(r, t) = {(h0 , r, t) | h0 ∈ E ∧ h0 6= h} . (38)
Sarah
University_of_Oxford Theoretically, we would need to exclude all positive
triples from this set of candidates for negative triples, i.e.,

Peter

Peter
Germany

Germany
University_of_Oxford

University_of_Oxford
Sarah
Anna
Car

Sarah
DHL

John

Anna
Car
DHL

John
Deutsche_Bank

Deutsche_Bank
N − = N \ K. In practice, however, since usually |N |  |K|,
the likelihood of generating a false negative is rather low.
Therefore, the additional filter step is often omitted to lower
computational cost. It should be taken into account that a
corrupted triple that is not part of the KG can represent a
true fact.
Fig. 3. Visualization of different training approaches for the relation
works_at in the KG in Figure 1. Red color indicates positive examples, UNS and BNS differ in the way they define sample
i.e. true triples present in the KG. Dark blue color denotes triples used weights for (h0 , r, t) or (h, r, t0 ):
as negative examples in LCWA. Light blue color sampling candidates for 3.2.2.1 Uniform negative sampling: With uniform
negative examples in sLCWA. Yellow color indicates triples that are not
considered.
negative sampling (UNS) [16], the first step is to randomly
(uniformly) determine whether h or t shall be corrupted for
a positive triple (h, r, t) ∈ K. Afterwards, an entity e ∈ E is
3.2 Training Approaches uniformly sampled and selected as the corrupted head/tail
Because most KGs are generated under the open world entity.
assumption (OWA), we require training approaches involv- 3.2.2.2 Bernoulli negative sampling: With
ing techniques such as negative sampling to avoid over- Bernoulli negative sampling (BNS) [17], the probability
generalization to true facts. Here, we describe two common of corrupting h or t in (h, r, t) ∈ K is determined by the
training approaches found in the literature: the local closed property of the relation r: if the relation is a one-to-many
world assumption (LCWA) and the stochastic local closed relation (e.g. motherOf ), BNS assigns a higher probability
world assumption (sLCWA). It should be noted that the to replace h, and if it is a many-to-one relation (e.g. bornIn)
LCWA and the sLCWA do not affect the evaluation. it assigns a higher probability to replace t. More precisely,
for each relation r ∈ R the average number of tails per
3.2.1 Local closed world assumption head (tph) and heads per tail (hpt) are first computed. These
The LCWA was introduced by [29] and used in subsequent statistics are then used to define a Bernoulli distribution
tph
works as an approach to generate negative examples during with parameter tph+hpt . For a triple (h, r, t) ∈ K the head
training [32], [24]. In this setting, for any triple (h, r, t) ∈ K is corrupted with probability tph+hpt tph
and the tail with
that has been observed, a set T − (h, r) of negative exam- hpt
probability tph+hpt . The described approach reduces the
ples is created by considering all triples (h, r, ti ) ∈ / K as
chance of creating corrupted triples that represent true
false. Therefore, for our exemplary KG (Figure 1) for the
facts [17].
pair (Peter, works at), the triple (Peter, works at, DHL) is a
false fact since for this pair only the triple (Peter, works at,
Deutsche Bank) is part of the KG. Similarly, we can construct 3.3 Loss Functions
H− (r, t) based on all triples (hi , r, t) ∈
/ K, or R− (h, t) based
The loss function can have a significant influence on the
on the triples (h, ri , t) ∈/ K. Constructing R− (h, t) is a
performance of KGEMs [7]. In the following, we describe
popular choice in visual relation detection domain [33], [34].
pointwise, pairwise, and setwise loss functions that have been
However, most of the works in knowledge graph modeling
frequently be used within KGEMs. For additional discussion
construct only T − (h, r) as the set of negative examples, and
and a slightly different categorization we refer to the work
in the context of this work refer to T − (h, r) as the set of
of Mohamed et al. [7].
negatives examples when speaking about LCWA.

3.2.2 Stochastic local closed world assumption 3.3.1 Pointwise Loss Functions
Under the stochastic local closed world assumption Let f denote the interaction model of a KGEM. With ti ,
(sLCWA), instead of considering all possible triples we denote a triple (i.e. ti ∈ K), and with li ∈ {0, 1} or
(h, r, ti ) ∈
/ K, (hi , r, t) ∈
/ K or (h, ri , t) ∈
/ K as false, we ˆli ∈ {−1, 1} its corresponding label, where 1 corresponds
randomly take samples of these sets. to the label of the positive triples, and 0 / -1 to the label
Two common approaches for generating negative sam- of the negative triples. Pointwise loss functions compute an
ples are uniform negative sampling (UNS) [16] and independent loss term for each triple-label pair, i.e. for a
Bernoulli negative sampling (BNS) [17] in which negative |B|
batch B = {(ti , li )}i=1 , the loss is given as
triples are created by corrupting a positive triple (h, r, t) ∈
K by replacing either h or t. We denote with N the set of all 1 X
L= L(ti , li ) (39)
potential negative triples: |B|
(ti ,li )∈B
ALI et al. 7

In the following, we describe four different pointwise losses: are already predicted well enough (according to the margin
The square error loss, binary cross entropy loss (BCEL), pointwise parameter λ).
hinge loss, and logistic loss.
3.3.1.1 Square Error Loss: The square error loss 3.3.2 Pairwise Loss Functions
function computes the squared difference between the pre- Next, we describe widely applied pairwise loss functions
dicted scores and the labels li ∈ {0, 1} [7]: that are used within KGEMs, namely the pairwise hinge loss
1 and the pairwise logistic loss. They both compare the scores
(f (ti ) − li )2
L(ti , li ) = (40) of a positive triple t+ and a negative triple t− . The negative
2
triple in a pair is usually obtained by corrupting the positive
The squared error loss strongly penalizes predictions that
one. Thus, the pairs often share common head or tail entities
deviate considerably from the labels, and is usually used − |B|
and relations. For a batch of pairs B = {(t+ i , ti )}i=1 , the
for regression problems. For simple models it often permits
loss is given as
more efficient optimization algorithms involving analyti-
cal solutions of sub-problems, e.g. the Alternating Least 1 X
L= L(f (t− +
i ) − f (ti )) . (44)
Squares algorithm used by [20]. |B| + −
(ti ,ti )∈B
3.3.1.2 Binary cross entropy loss: The binary cross
entropy loss is defined as [32]: Hence, the loss function evaluates the difference in scores
L(ti , li ) = − (li · log(σ(f (ti ))) ∆ = f (t− +
i ) − f (ti ) between a positive and a negative triple,
(41) rather than their absolute scores. This is in accordance to
+ (1 − li ) · log(1 − σ(f (ti )))), the OWA assumption, where we do not assume to have
where li ∈ {0, 1} and σ represents the logistic sigmoid func- negative labels, but just ”less positive” ones.
tion. Thus, the problem is framed as a binary classification 3.3.2.1 Pairwise Hinge Loss/Margin ranking loss:
problem of triples, where the model’s outputs are regarded The pairwise hinge loss or margin ranking loss (MRL) is
as logits. The loss is not well-suited for translational distance given by
models because these models produce a negative distance L(∆) = max(0, λ + ∆) . (45)
as score and cannot produce positive model outputs. ConvE
3.3.2.2 Pairwise Logistic Loss: The pairwise logistic
and TuckER were originally trained in a multi-class setting
loss is defined as [7]:
using the binary cross entropy loss where each (h, r)-pair
has been classified against e ∈ E simultaneously, i.e., if L(∆) = log(1 + exp(∆)) . (46)
|E| = n, the label vector for each (h, r)-pair has n entries
indicating whether the triple (h, r, ei ) is (not) part of the Thus, it can be seen as a soft-margin formulation of the
KG, and along each dimension of the label vector a binary pairwise hinge loss with a margin of zero.
classification is performed. It should be noted that there
exist different implementation variants of the binary cross 3.3.3 Setwise Loss Functions
entropy loss that address numerical stability. ConvE and Setwise loss functions neither compare individual scores,
TuckER employed a numerically unstable variant, and in the or pairs of them, but rather more than two triples’ scores.
context of this work, we refer to this variant when referring Here, we describe the self-adversarial negative sampling
to the binary cross entropy loss. loss (NSSAL) and the cross entropy loss (CEL) as exam-
3.3.1.3 Pointwise Logistic Loss/Softplus loss: An ples of such loss functions that has been applied within
alternative, but equivalent formulation of the binary cross KGEMs [22], [7].
entropy loss is the pointwise logistic loss (or Softplus loss 3.3.3.1 Self-adversarial negative sampling loss: In-
(SPL)): stead of treating each negative sample equally in the
L(ti , li ) = log(1 + exp(−ˆli · f (ti )) (42) loss function, the self-adversarial negative sampling loss
(NSSAL) weights negative samples according to the pre-
where ˆ li ∈ {−1, 1} [7]. It has been used to train ComplEx, dicted scores by the interaction model [22]. Let (h, r, t) ∈ K
ConvKB, and SimplE. We consider both variants separately denote a true triple, and {(h0i , r, t0i )}K
i=1 the set of negative
because both have been used in different model imple- samples generated for it. Then, we define the following
mentations, and their implementation details might yield distribution over the negative samples
different results (e.g., to numerical stability).
3.3.1.4 Pointwise Hinge Loss: The pointwise hinge exp(αf (h0i , r, t0i ))
p((h0i , r, t0i )) = Pn 0 0 , (47)
loss sets the score of positive examples larger than a margin j=1 exp(αf (hj , r, tj ))
parameter λ while reducing the scores of negative examples where α ∈ R denotes a temperature parameter. The proba-
to values below −λ: bilities of the negative triples are then used as weights for
L(ti , li ) = max(0, λ − ˆli · f (ti )) (43) them to compute the loss:

where ˆ li ∈ {−1, 1}. The loss penalizes scores of positive L = − log(σ(γ + f (h, r, t)))
examples which are smaller than λ, but does not impose any K
X (48)
restriction on values > λ. Similarly, negative scores larger − p((h0 , r, t0 )) · log(σ(−(γ + f (h0i , r, t0i )))) .
than −λ contribute to the loss, whereas all values smaller i=1

than −λ do not have any loss contribution [7]. Thereby, the Thus, negative samples for which the model predicts a high
model is not encouraged to further optimize triples which score relative to other samples are weighted stronger.
ALI et al. 8

3.3.3.2 Cross entropy loss: The cross entropy loss For each t+ and its corresponding corrupted triples, the
(CEL) has been successfully applied together with 1-N scores are computed and the entities sorted accordingly.
scoring, i.e., predicting for each (h, r)-pair simultaneously a Next, the rank of every t+ among its corrupted triples is
score for each possible tail entity, and framing the problem determined, i.e. the position in the score-sorted list.
as a multi-class classification problem [3], [8]. To apply the Among the corrupted triples in H(r, t) / T (h, r), there
CEL, first, the labels are normalized in order to form a might be true triples that are part of the KG. If these false
proper probability distribution. Second, the predicted scores negatives are ranked higher than the current test triple
for the tail entities of (h, r)-pair are normalized by a soft- t+ , the results might get distorted. Therefore, the filtered
max: evaluation setting has been proposed [16], in which the
exp(f (h, r, t)) corrupted triples are filtered to exclude known true facts
p(t | h, r) = P . (49)
exp(f (h, r, t0 )) from the train and test set. Thus, the rank does not decrease
t0 ∈E
when ranking another true entity higher.
Finally, the cross entropy between the distribution of the Moreover, we want to draw attention to the fact that
normalized scores and the normalized label distribution is the metrics can be further be distorted by unknown false
computed: negatives, i.e., true triples that are contained in the set of
X corrupted triples but are not part of the KG (and therefore
L=− I[(h, r, t0 ) ∈ K] · log(p(t | h, r)) , (50) cannot be filtered out). Therefore, it is essential to investigate
t0 ∈E
the predicted scores of a KGEM and not solely rely on the
where I denotes the indicator function. Note that this loss computed metrics.
differs from the multi-class binary cross entropy as it applies Based upon these individual ranks, the following mea-
a softmax normalization implying that this is a single-label sures are frequently used to summarize the overall perfor-
multi-class problem. mance:
4.0.0.1 Mean rank: The mean rank (MR) represents
3.4 Explicitly Modeling Inverse Relations the average rank of the test triples, i.e.
Inverse relations introduced by [23] and [35] are explicitly 1 X
MR = rank(t) (51)
modeled by extending the set of relations R by a set of |Ktest | t∈K
test
inverse relations rinv ∈ Rinv with Rinv ∩ R = ∅. This
is achieved by training an inverse triple (t, rinv , h) for Smaller values indicate better performance.
each triple (h, r, t) ∈ K. Equipping a KGEM with inverse 4.0.0.2 Adjusted mean rank: Because the interpre-
relations implicitly doubles the relation embedding space of tation of the MR depends on the number of available
any model that has relation embeddings. The goal is to alter candidate triples, comparing MRs across different datasets
the scoring function, such that the task of predicting the (or inclusion of inverse triples) is difficult. This is some-
head entities for (r, t) pairs becomes the task of predicting times further exacerbated in the filtered setting because
tail entities for (t, rinv ) pairs. The explicit training of the the number of candidates varies. Therefore, with fewer
implicitly known inverse relations can lead to better model candidates available, it becomes easier to achieve low ranks.
performance [35] and can for some models increase the The adjusted mean rank (AMR) [10] compensates for this
computational efficiency [32]. problem by comparing the mean rank against the expected
mean rank under a model with random scores:
MR
4 E VALUATION M ETRICS FOR KGEM S AMR = 1 P (52)
2 (ξ(t) + 1)
KGEMs are usually evaluated based on link prediction, t∈Ktest
which is on KG defined as predicting the tail/head entities where ξ(t) denotes the number of candidate triples against
for (h, r)/(r, t) pairs. For instance, given queries of the which the true triple t ∈ Ktest is ranked. In the unfiltered
form (Sarah, studied at, ?) or (?, CEO of, Deutsche Bank) the setting we have ξ(t) = |E| − 1 for all t ∈ Ktest . Thereby,
capability of a link predictor to predict the correct entities the measure also adjusts for chance, as a random scoring
that answer the query, i.e. (Sarah, studied at, University of achieves an expected adjusted mean rank of 1. The AMR
Oxford) and (Sarah, CEO of, Deutsche Bank) is measured. has a fixed value range from 0 to 1, where smaller values
However, given the fact that usually true negative ex- (AMR  1) indicate better performance.
amples are not available, both the training and the test 4.0.0.3 Mean reciprocal rank: The mean reciprocal
set contain only true facts. For this reason, the evaluation rank (MRR) is defined as:
procedure is defined as a ranking task in which the capa-
bility of the model to differentiate corrupted triples from 1 X 1
MRR = (53)
known true triples is assessed [16]. For each test triple |Ktest | t∈K rank(t)
test
t+ = (h, r, t) ∈ Ktest two sets of corrupted triples are
where Ktest is a set of test triples, i.e. the MRR is the
constructed:
mean over reciprocal individual ranks. However, the MRR
1) H(r, t) = {(h0 , r, t) | h0 ∈ E −{h} which contains all is flawed since the reciprocal rank is an ordinal scale and
the triples where the head entity has been corrupted, not an interval scale, i.e. computing the arithmetic mean is
and statistically incorrect [36], [37]. Still, it is often used for early
2) T (h, r) = {(h, r, t0 ) | t0 ∈ E − {t}} that contains all stopping since it is a smooth measure with stronger weight
the triples with corrupted tail entity. on small ranks, and less affected by outlier individual ranks
ALI et al. 9

than the mean rank. The MRR has a fixed value range from TABLE 2
0 to 1, where larger values indicate better performance. Existing Benchmark Datasets.
4.0.0.4 Hits@K: Hits@K denotes the ratio of the test
triples that have been ranked among the top k triples, i.e., Dataset Triples Entities Relations
FB15K 592,213 14.951 1,345
|{t ∈ Ktest | rank(t) ≤ k}| FB15K-237 272,115 14,541 237
Hits@k = (54)
|Ktest | WN18 151,442 40,943 18
WN18RR 93,003 40,943 11
Larger values indicate better performance. Kinships 10,686 104 26
4.0.0.5 Additional Metrics: Further metrics that Nations 11,191 14 56
UMLS 893,025 135 49
might be relevant are the area under the Receiver Operating YAGO3-10 1,079,40 132,182 37
Characteristic curve (AUC-ROC) and the area under the
precision-recall curve (AUC-PR) [11]. However, these met-
rics require the number of true positives, false positives, true 5.0.0.6 Nations: The Nations [40] dataset contains
negatives, and false negatives, which in most cases cannot data about countries and their relationships with other
be computed since the KGs are usually incomplete. countries. Exemplary relations are economic aid and accusa-
tion [14].
5 E XISTING B ENCHMARK DATASETS 5.0.0.7 Unified Medical Language System [41]: The
Unified Medical Language System (UMLS) [41] is an ontol-
In this section, we describe the benchmark datasets that
ogy that describes relationships between high-level concepts
have been established to evaluate KGEMs. A summary is
in the biomedical domain. Examples of contained concepts
also given in Table 2.
are Cell, Tissue, and Disease, and exemplary relations are
5.0.0.1 FB15K: Freebase is a large cross-domain KG
part of and exhibits [14], [41].
consisting of around 1.2 billion triples and more than 80
5.0.0.8 YAGO3-10: Yet Another Great Ontology
million entities. Bordes et al. [16] extracted a subset of
(YAGO) [42] is a KG containing facts that have been ex-
Freebase, which is used as a benchmark dataset and named
tracted from Wikipedia and aligned with WordNet in or-
it FB15K. It contains 14,951 entities, 1,345 relations, as well
der to exploit the large amount of information contained
as more than half a million triples describing facts about
in Wikipedia and the taxonomic information included in
movies, actors, awards, sports, and sports teams [32].
WordNet. It contains general facts about public figures,
5.0.0.2 FB15K-237: FB15K has a test-leakage, i.e.
geographical entities, movies, and further entities, and it
a major part of the test triples (∼81%) are inverses of
has a taxonomy for those concepts. YAGO3-10 is a subset of
triples contained in the training set: for most of the test
YAGO3 [43] (which is an extension of YAGO) that contains
triples of the form (h, r, t), there exists a triple (h, r0 , t)
entities associated with at least ten different relations. In
or (t, r0 , h) in the training set. Therefore, Toutanova and
total, YAGO3-10 has 123,182 entities and 37 relations, and
Chen [38] constructed FB15K-237 in which inverse relations
most of the triples describe attributes of persons such as
were removed [38]. FB15K-237 contains 14,541 entities and
citizenship, gender, and profession [32].
237 relations.
5.0.0.3 WN18: WordNet1 is a lexical knowledge
base in which entities represent terms and are called synsets. 6 R EPRODUCIBILITY S TUDIES
Relations in WordNet represent conceptual-semantic and
lexical relationships (e.g. hyponym). Bordes et al. [14] ex- The goal of the reproducibility studies was to investigate
tracted a subset of WordNet named WN18 that is frequently whether it is possible to replicate experiments based on
used to evaluate KGEMs. It contains 40,943 synsets and 18 the information provided in each model’s accompanying
relations. paper. If specific information was missing, such as the
5.0.0.4 WN18RR: Similarly to FB15K, WN18 also number of training epochs, we tried to find this informa-
has a test-leakage (of approximately 94%) [38]. For instance, tion in the accompanying source code if it was accessible.
for most of the test triples of the form (h, hyponym, t), For our study, we focused on the two most frequently
there exists a triple (t, hypernym, o) in the training set. used benchmark datasets, FB15K and WN18, as well as
Dettmers et al. [32] have shown that a simple rule-based their respective subsets FB15K-237 and WN18RR. Table 3
system can obtain results competitive to the state of the art illustrates for which models results were reported (in the
results on WN18. For this reason, they constructed WN18RR accompanying publications) for the considered datasets. A
by removing inverse relations similarly to the procedure checkmark denotes that results were reported, and green
applied to FB15K. WN18RR contains 40,943 entities and 11 background indicates that the entire experimental setup for
relations. the corresponding dataset was described. Results have not
5.0.0.5 Kinships: The Kinships [39] dataset de- been reported for every model for every dataset because
scribes relationships between members of the Australian some of the benchmark datasets were created after the
tribe Alyawarra and consists of 10,686 triples. It contains 104 models were published. Therefore, these models have been
entities representing members of the tribe and 26 relation- excluded from our reproducibility study.
ship types that represent kinship terms such as Adiadya or 6.0.0.1 Experimental Setup: For each KGEM, we
Umbaidya [14]. applied identical training and evaluation settings as de-
scribed in their concomitant papers. We ran each experiment
1. https://wordnet.princeton.edu/ four times with random seeds to measure the variance in
ALI et al. 10

TABLE 3 Even within the same framework, bug fixes or optimizations


Denotes for each proposed model whether results have been reported of the framework can lead to different results based on
for FB15K, WN18, or their alterations. Furthermore, it indicates whether
an official implementation exists where P corresponds to a PyTorch the used version. Our benchmarking study highlights that
based implementation, T to a TensorFlow based implementation, and O with adapted settings, results can be reproduced and even
to other implementations. A green background indicates that the full improved.
experimental setup was available. The models highlighted with * where
included in the reproducibility study.
6.2 Unreported Hyper-parameters Impedes Reproduc-
Model Code FB15K FB15K-237 WN18 WN18RR tion
ComplEx* O X X Some experiments did not report the full setup, which
ConvE* P X X X X impedes the reproduction of results. For example, the em-
ConvKB T X X beddings in the ConvKB experiments have been pre-trained
DistMult* - X X
ER-MLP -
based on TransE. However, the batch size for training TransE
HolE* O X X has not been reported, which can have a significant effect on
KG2E* - X X the results, as previously discussed. Furthermore, we could
NTN - not reproduce the results for HolE on FB15K. The apparent
ProjE T X X
RESCAL O reason is that we could not find the hyper-parameter setting
RotatE* P X X X X for FB15K, such that we used the same setting as for WN18
SE O which we found in the accompanying implementation.
SimplE* T, P X X
TransD* - X X
TransE* O X X 6.3 Two Perspectives: Publication versus Implementa-
TransH* - X X tion
TransR* O X X
TuckER* P X X X X While preparing our experiments, we observed that for
UM - some experiments, essential aspects, which are part of the
released source code, have not been discussed in the paper.
For instance, in the publication describing ConvE, it is not
the obtained results. We evaluated the models based on the mentioned that inverse triples have been added to the KGs
ranking metrics MR, AMR, MRR, and Hits@K. As discussed in a pre-processing step. This step seems to be essential
in [4], [10], the exact computation of ranks differs across dif- to reproduce the results. A second example is SimplE, for
ferent codebases, and can lead to significant differences [4]. which the predicted scores have been clamped to the range
We follow the nomenclature of Berrendorf et al. [10], and re- of [−20, 20]. This step was not mentioned in the publication,
port scores based on the optimistic, pessimistic, and realistic but it can have a significant effect when the model is
rank definitions. evaluated based on an optimistic ranking approach, which
Tables 8, 9, 8, and 10 represent the results for FB15K, is the case for SimplE.
FB15K-237, WN18, and WN18RR where experiments high-
lighted in blue represent soft-reproducible experiments, i.e.,
6.4 Lack of Official Implementations Impedes Repro-
could be reproduced by some margin, and experiments
duction
highlighted in orange represent extreme outliers. In the
following, we discuss the observations that we made during During our experiments, we observed that for DistMult and
our experiments. TransD, we were able to reproduce the results on WN18,
but not on FB15K. A reason might be differences in the
implementation details of the frameworks used to train
6.1 Reproductions Requiring Alternate Hyper- and evaluate the models. For example, the initialization
Parameters of the embeddings or the normalization of the loss values
One of the observations we made is that for some ex- could have an impact on the performance. Since there exists
periments, results of similar performance could only be no official implementation (see Table 3) for DistMult and
obtained based on a different set of hyper-parameter values. TransD, it is not possible to check the above-mentioned
For instance, the results for TransE could only be reproduced aspects. Furthermore, we were not able to reproduce the
by adapting the batch size and the number of training results for TransH for which also no official implementation
epochs. We trained TransE on WN18 for 5000 epochs com- is available. There exist reference implementations2 , which
pared to a reported number of 1000 epochs in order to obtain slightly differ from the model initially proposed.
comparable results. Furthermore, for RotatE on FB15K and
WN18, we received better results when adapting the learn- 6.5 Reproducibility is Dependent on The Ranking Ap-
ing rate. The reason for these differences might be explained proach
by the implementation details of the underlying frameworks As discussed in [4], [10], the ranking metrics have been
which have been used to train the models. Authors of implemented differently by various authors. In our experi-
early KGEMs often implemented their training algorithms ments, we report results based on three common implemen-
themselves or used frameworks which were popular at tations of the ranking metrics: i.) average, ii.) optimistic and
the respective time, but are not used anymore. Therefore, iii.) pessimistic ranking (Section 4). If a model predicts the
differences between the former and current frameworks
may require an adaption of the hyper-parameter values. 2. https://github.com/thunlp/OpenKE
ALI et al. 11

same score for many triples, there will be a large discrepancy TABLE 4
between the three ranking approaches. We could observe Hyper-Parameter Ranges for Ablation Experiments
such a discrepancy for SimplE for which the results on
FB15K (Table 11) and WN18 (Table 13) were almost 0% Hyper-Parameter Range
based on the average ranking approach, but were much Embedding-Dimension {64,128,256}
higher based on the optimistic ranking approach. Similar Initialization {Xavier}
Optimizersa {Adam, Adadelta}

Shared
observations for other KGEM have been made in [4].
Learning Rate (log scale) [0.001, 0.1)
Batch Sizeb {128, 256, 512}
Model inverse relations {Yes, No}
7 B ENCHMARKING Epochs 1,000
In our benchmarking studies, we evaluated a large set Loss {BCEL, MRL, NSSAL, SPL}

sLCWA
of different combinations of interaction models, training Margin for MRL {0.5, 1.5, ... , 9.5}
approaches, loss functions, and the effect of explicitly mod- Margin for NSSAL {1, 3, 5, ... , 29}
ADVT for NSSAL {0.1, 0.2, ... , 1.0}
eling inverse relations. In particular, we investigated 19 Number of Negativesc {1, 2, ... , 100}
interaction models, two training approaches, and five loss

LCWA
Loss {BCEL, CEL, SPL}
functions on four datasets. We refer to a specific combination Label Smoothing (log scale) [0.001, 1.0)
of interaction model, training approach, loss function, and
whether inverse relations are explicitly modeled as a configu- a
For Kinships, we evaluated Adam and Adadelta, and for the
ration, e.g., RotatE + LCWA + SPL + inverse relations. We do remaining datasets we sticked to Adam since it performed
not refer to different hyper-parameter values such as batch almost in every experiment at least equally good as Adadelta
and in many experiments significantly better.
size or learning rate when we use the term configuration. b
For YAGO3-10, the batch-size has been sampled from the set
For each configuration, we used random search to perform {1024, 2048, 2096, 8192}.
c
the hyper-parameter optimizations over all other hyper- For YAGO3-10, the number of negative triples per each each
parameters and applied early stopping on the validation set. positive has been sampled from the set {1, 2, ..., 50}.
Each hyper-parameter optimization experiment lasted for a
maximum of 24 hours or 100 iterations, in which new hyper-
parameters have been sampled in each iteration. Overall, extent of our benchmarking study and the fact that YAGO3-
we performed individual hyper-parameter optimizations 10 contains more than 132,000 entities, which makes the
for more than 1,000 configurations. We retrain the model training based on the LCWA with 1-n scoring expensive, we
with the best hyper-parameter setting and report evaluation restricted the training approach to the sLCWA for YAGO3-
results on the test set. 10.
Before presenting our results, we provide an overview of 7.0.0.4 Loss Functions: We investigated MRL,
the experimental setup, comprising the investigated inter- BCEL, SPL, NSSAL, and CEL since they represent the vari-
action models, training approaches, loss functions, negative ety of types described in Section 3.3 and because they have
samplers, and datasets. We used the sLCWA and LCWA been previously shown to yield good results. MRL has not
as training approaches. For the sLCWA we applied a 1:k- been historically used in the 1-N scoring setting likely due
Scoring as usually done throughout the literature [16], [21], to the fact that in 1-N scoring, the number of positive and
where k denotes the number of negative examples for each negative scores in each batch is not known in advance and
positive. For the LCWA, we applied a 1:N-Scoring, i.e., dynamic. Thus, the number of possible pairs varies as well
we sample each batch against all negatives examples as ranging from N − 1 to (N/2)2 for each (h, r) combination.
typically done for training with the LCWA [32]. Table 4 The accompanying variance in memory requirements for
shows the hyper-parameter ranges for the sLCWA and the each batch thus poses practical challenges. Therefore, we
LCWA assumptions. did not use the MRL in combination with the 1-N scoring
7.0.0.1 Datasets: We performed experiments on the setting.
following four datasets: WN18RR, FB15K-237, Kinships and 7.0.0.5 Negative Sampler: When using the sLCWA,
YAGO3-10. We selected WN18RR and FB15K-237 since we generated negative samples with UNS. When training
they are widely applied benchmarking datasets. We chose with the LCWA and 1-N scoring, no explicit negative sam-
Kinships and YAGO3-10 to investigate the performance of pling was required.
KGEMs on a small and a larger dataset. 7.0.0.6 Early Stopping: We evaluated each model
7.0.0.2 Interaction Models: We investigated all in- every 50 epochs and performed early stopping with a
teraction models described in Section 3.1. Because of our patience of 100 epochs on all datasets except for YAGO3-
vast experimental setup and the size of YAGO3-10, we 10. There, considering the larger number of triples seen in
restricted the number of interaction models on YAGO3-10 each epoch we evaluated each model every 10 epochs and
as otherwise, the computational effort would be prohibitive. performed early stopping with a patience of 50 epochs.
Based on their variety of model types as described in Sec- Below, we describe the results of our benchmarking
tion 3.1, we selected the following ten interaction models: study. In the four following subsections, we summarize the
ComplEx, ConvKB, DistMult, ERMLP, HolE, RESCAL, Ro- results for each dataset (i.e., Kinships, WN18RR, FB15K-
tatE, SE, TransD, and TransE. 237, YAGO3-10) along with a discussion of the effect of the
7.0.0.3 Training Approaches: We trained the in- models’ individual components (i.e., training approaches,
teraction models based on the sLCWA (Section 3.2.2) and loss functions, the explicit modeling of inverse relations)
the LCWA (Section 3.2.1) training approaches. Due of the and optimizers on the performance. Finally, we compare the
ALI et al. 12

model complexity versus performance. In the appendix, we (e.g., ComplEx, ConvKB) obtain high results across all loss
provide further results. In particular, we provide for each functions, and others such as ConvE and KGE2E suffer for
model the results of all tested combinations of interaction some of the loss functions.
model, training approach, and loss function. 7.1.0.4 Impact of Explicitly Modeling Inverse Re-
lations: Figures 7-9 present the effect of explicitly modeling
inverse relations. It can be observed that in general, the
7.1 Results on the Kinships Dataset
LCWA benefits from explicit usage of inverse relations in
Investigating the model performances on Kinhsips is in- terms of robustness, but sLCWA based configurations can
teresting because it is a comparatively small KG and thus lead to similar results.
permits for each configuration a large number of HPO The loss functions that are trained under the LCWA
iterations for all interaction models. Figure 4 provides a are more robust when trained with inverse relations. Loss
general overview of the results, i.e., performance of the functions that have been trained based on the sLCWA
interaction models, loss functions, training approach, the exhibit in all cases except for NSSAL a lower variance when
effect of modeling inverse relations, and the effect of the op- inverse relations are not explicitly modeled (Figure 8). This
timizers. Overall, several interaction models performed very observation highlights that the modeling of inverse relations
well (hits@10 higher than 90%), some performed poorly (less is closely related to the applied training approach.
than 80% hits@10), and the remaining interaction models re- Considering the interaction models (Figure 9), it can be
vealed average performance (80%-90% hits@10; see Figure 4 seen that for some combinations of interaction models and
as well as Figure 28 and Figure 27 that can be found in the loss functions, the impact of explicitly modeling inverse
appendix). relations is significant. For instance, when trained based
The key observations are that the interaction models on CEL, ERMLP performs significantly better when trained
TransD, TransH, TransR, and UM performed poorly (less with explicit inverse relations, whereas ConvE suffers from
than 80%). UM significantly under-performed, which is not explicit inverse relations when trained with the MR loss.
surprising since it does not consider the multi-relational Here, we want to highlight the close connection to the train-
information of the data. For a dataset with similar struc- ing approaches. Therefore, it is important to consider these
tural properties such as Kinships, the interaction models results in the context of the results depicted in Figures 5-7.
ComplEx and SimplE seem to be excellent choices since 7.1.0.5 Model Complexity versus Performance:
they do not have a very high model complexity, obtain Figure 26 (Appendix) plots the model size against the ob-
good performances, and are robust throughout a large set tained performance. The results highlight that there is no
of different configurations. In the following, we examine the strong correlation between model size and performance, i.e.,
effect of the single components that define a KGEM. models with a small number of parameters can perform
7.1.0.1 Impact of the Optimizer: On Kinships, we equally well as large models on the Kinships data set. The
evaluated Adam and Adadelta as optimizers. Our results skyline comprises small UM models, some intermediate
highlight that in most cases, Adam outperforms Adadelta HolE and ProjE models, and larger RotatE and TuckER
(in many cases with high margin) or at least performs models. A full list is provided in Table 17 (in Appendix).
equally well (Figure 4). One of the few interaction models
that performed very well across all configurations that have
been trained with Adadelta is ConvE (Figure 27 in the 7.2 Results on the WN18RR Dataset
appendix). Figure 10 depicts the overall results over WN18RR. A de-
Since Adam performed better for the majority of the tailed overview over all configurations can be found in Fig-
configurations and the fact that there was always at least ure 29 in the appendix. The results highlight that there are
one Adam-based configuration that performed equally well several combinations of interaction models, loss functions,
as the best Adadelta configuration, we decided to progress and training approaches that obtain hits@10 results that
only with Adam for the remaining datasets in order to are competitive with state-of-the-art results3 . In particular,
reduce the computational costs. ComplEx (53.49%), ConvE (56.10% compared to 52.00% in
7.1.0.2 Impact of the Training approach: Figure 5 the original paper [32] ), DistMult (52.96%), ProjE (51,74%),
and Figure 6 depict the effect of the training approaches. The TransE (56.98%), RESCAL (53.64%), RotatE (60.09% com-
former summarizes the results over the interaction models, pared to 56.61% in the original paper [22]), SimplE (50.92%),
and the latter differentiates between them. At this point, we and TuckER (54.26% compared to 52.6% in the original
focus only on the BCEL and the SPL since they have been paper [24]) obtained high performance. Especially the result
trained with both training approaches. The key observation obtained by TransE is impressive, since it beats most of the
is that both training approaches perform comparably well published state-of-the-art results. The results highlight that
for the BCEL and SPL functions. However, BCEL reveals a determining an appropriate configuration is fundamental
slightly less variance when trained in combination with the since many interaction models such as ConvE and TransE
sLCWA, and the SPL varies less when trained based on the reveal a high variance across different configurations.
LCWA. Besides, the results show that with an appropriate con-
7.1.0.3 Impact of the Loss Function: Figure 5 high- figuration, the model complexity can be significantly re-
lights that all five loss functions achieve high performance, duced. For instance, for RotatE, several high performing
but at the same time, most of them exhibit high variance. configurations have been found (Figure 29 in the Appendix),
An exception is the MRL, which is comparatively robust.
Figure 32 (Appendix) reveals that some interaction models 3. https://paperswithcode.com/sota/link-prediction-on-wn18rr
ALI et al. 13

1.0
Model 1.0
Loss / Training Approach

0.8 0.8

0.6 0.6

hits@10

hits@10
0.4 0.4

0.2 0.2

0.0 0.0

Tra sD
Tra nsE
TransH
Tuc nsR
kER

a)

a)

a)

)
Co nvE
Dis vKB
ER Mult
HoLP
KG lE
NT E
Pro N
RE jE
RoCAL
tatE
Sim E

UM

AL
CoplEx

Tra plE

wa
CE

MR
2

(lcw

(ow

cw
n
M

L (o
n
t

NS
S
m

L (l
EL
Co

EL

SP
SP
BC
BC
1.0
Optimizer 1.0
Inverse Relations

0.8 0.8

0.6 0.6
hits@10

hits@10
0.4 0.4

0.2 0.2

0.0 0.0

e
lta

am

se

Tru
ade

Fal
Ad
Ad

Fig. 4. Overall hits@10 results for Kinships where box-plots summarize the best results across different configurations, i.e., combinations of
interaction models, training approaches, loss functions, and the explicit usage of inverse relations.

BCEL
LCWA

CEL
Inverse Relations
False
Training Approach True
MRL LCWA
sLCWA
sLCWA
NSSAL

0.0 0.2 0.4 0.6 0.8 1.0


SPL hits@10

0.0 0.2 0.4 0.6 0.8 1.0


hits@10 Fig. 7. Impact of explicitly modeling inverse relations on the performance
for a fixed training approach for the Kinships dataset based on Adam.

Fig. 5. Impact of the training approach on the performance for a fixed


loss function for the Kinships dataset based on Adam.
BCEL (LCWA)
BCEL SPL
ComplEx
ConvE BCEL (sLCWA)
ConvKB
DistMult
ERMLP CEL
HolE
KG2E Inverse Relations
NTN
ProjE MRL False
RESCAL True
RotatE
SE NSSAL
SimplE
TransD
TransE Training Approach SPL (LCWA)
TransH
TransR LCWA
TuckER sLCWA
UM SPL (sLCWA)
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10 0.0 0.2 0.4 0.6 0.8 1.0
hits@10

Fig. 6. Impact of training approach on the performance for a fixed


interaction model and loss function for the Kinships dataset based on Fig. 8. Impact of explicitly modeling inverse relations on the loss function
Adam. based on Adam for Kinships.

and the second-best configuration achieved a hits@10 value


of 59.34% (compared to 56.61% in the original paper [22]) that RotatE originally has been trained based with an em-
while trained with an embedding dimension of 64 (in the bedding dimension of 500 (using the sLCWA as training
complex space). This is especially interesting considering approach and the NSSAL as loss function) achieving a
ALI et al. 14

BCEL CEL Consequently, it may be more difficult to find configurations


ComplEx
ConvE that obtain high performance.
ConvKB
DistMult
ERMLP Figure 12 shows the impact of the training approaches
HolE
KG2E
NTN
for fixed interaction models and used loss functions. The
ProjE
RESCAL
results indicate that for some combinations of interaction
RotatE models and loss functions, the training approach’s choice
SE
SimplE
TransD has a significant impact on the results. For example, RotatE
TransE
TransH Inverse Relations and ConvE show stronger performance when trained with
TransR False
TuckER
UM
True LCWA, whereas TransE and TransH suffer under the LCWA.
7.2.0.2 Impact of the Loss Function: Figures 10
MRL NSSAL and 11 depict the performance of the different loss functions.
ComplEx
ConvE State-of-the-art results for WN18RR are currently between
ConvKB
DistMult
ERMLP 50% and 60%, and for each loss function, the 50% border
HolE
KG2E could be crossed. The results highlight that there is a trade-
NTN
ProjE
RESCAL
off between highest performance and robustness, i.e., SPL
RotatE and BCEL achieve the highest performance (when trained
SE
SimplE
TransD under the LCWA), but also have high variance (especially
TransE
TransH BCEL + LCWA).
TransR
TuckER
UM
Figure 33 (Appendix) reveals that some interaction mod-
0.0 0.2 0.4 0.6 0.8 1.0 els do not perform well with specific loss functions. For
hits@10
SPL instance, ConvE suffers from the MRL and NSSAL whereas
ComplEx
ConvE NTN and SE, and TransR suffer from the CEL. However,
ConvKB
DistMult
ERMLP it should be taken into account that NTN or SE, combined
HolE
KG2E with the CEL, is a computational expensive configuration
NTN
ProjE
RESCAL
making extensive HPOs infeasible.
RotatE 7.2.0.3 Impact of Explicitly Modeling Inverse Re-
SE
SimplE
TransD lations: Figure 13 and 14 illustrate that the explicit mod-
TransE
TransH eling of inverse relations benefits the LCWA. Interestingly,
TransR
TuckER
UM
only few sLCWA based configurations achieve competitive
0.0 0.2 0.4 0.6 0.8 1.0 results when inverse relations are used. In this context, the
hits@10
behavior of the NSSAL is notable. When trained together
with inverse relations, it expresses low variance and is
Fig. 9. Impact of explicitly modeling inverse relations on the performance
for a fixed interaction model and loss function for the Kinships dataset comparably robust. However, there are only a few high-
based on Adam. performing configurations that involve inverse relations.
For these configurations, it outperforms all other loss func-
tions trained together with sLCWA.
hits@10 of 57.1% 4 . By changing the training approach from 7.2.0.4 Model Complexity vs. Performance: Fig-
sLCWA to LCWA and changing the loss function from ure 26 (Appendix) highlights that there is no significant
NSSAL to SPL, the embedding dimension could be reduced correlation between model size and performance. As pre-
significantly while getting at the same time an improvement viously mentioned, we determined a RotatE based configu-
in the hits@10 score. ration that uses an embedding dimension of only 64 that is
A further interesting observation is the performance of competitive to the state-of-the-art results that typically use
UM, which does not model relations, but can still compete much higher embedding dimensions. The skyline comprises
with some of the other interaction models on WN18RR. This small ConvKB models, an intermediate ConvE model, and
observation might indicate that the relational patterns in large RotatE models. A full list is provided in Table 18 (in
WN18RR are not too diverse across relations. appendix).
Finally, Figure 10 reveals that we can achieve good
performance with all loss functions where BCEL and SPL
obtain the best results. However, the loss functions that have 7.3 Results on the FB15K-237 Dataset
been trained in combination with the LCWA express higher Figure 16 provides an overall overview of the results ob-
variance. tained on FB15K-237. For the results for each individual
7.2.0.1 Impact of the Training Approach: Figure 11 configuration, we refer to Figure 30 in the appendix. The
depicts the impact of the training approach for a fixed loss results illustrate that the best results of several interaction
function. For now, we focus only on BCEL and SPL since models such as DistMult, TransE, RESCAL, and TuckER
they have been trained under both the sLCWA and LCWA. are comparable, whereas RotatE slightly outperforms the
The figure highlights that for both loss functions, the LCWA other interaction models. Furthermore, the results show that
achieves higher maximum performance, but at the same interaction models such as ProjE, SimplE, and UM that
time, it reveals a larger variance on both loss functions. have been competitive on WN18RR are less competitive on
FB15K-237 (at least in our experimental set up). Moreover,
4. https://github.com/DeepGraphLearning/ the performances of the five loss functions are close to each
KnowledgeGraphEmbedding other.
ALI et al. 15

1.0
Model Loss / Training Approach Inverse Relations

0.8

0.6

hits@10 0.4

0.2

0.0

Tra sD
Tra sE
nsH
Tuc sR
kER

e
a)

a)

a)

)
Co vE
Dis KB
ER ult
Ho P
lE
2E
N
RE jE
Ro AL
tatE
SE

UM

e
Co lEx

plE

wa

Tru
CE

MR

SA

s
ML

NT

(lcw

(ow

(lcw
Pro
tM

Fal
n
n

n
KG

SC
nv
mp

Sim

L (o
Tra

Tra

NS
EL
Co

EL

SP
SP
BC
BC
Fig. 10. Overall hits@10 results for WN18RR where box-plots summarize the results across different combinations of interaction models, training
approaches, loss functions, and the explicit usage of inverse relations.

Training Approach Inverse Relations


False
BCEL LCWA True
sLCWA
LCWA

CEL

MRL

sLCWA
NSSAL

0.0 0.2 0.4 0.6 0.8 1.0


SPL hits@10

0.0 0.2 0.4 0.6 0.8 1.0


hits@10 Fig. 13. Impact of explicitly modeling inverse relations on the perfor-
mance for a fixed training approach for the WN18RR dataset.

Fig. 11. Impact of the training approach on the performance for a fixed
loss function for the WN18RR dataset.
BCEL (LCWA)

BCEL SPL BCEL (sLCWA)


ComplEx Training Approach
ConvE
DistMult LCWA CEL
ERMLP sLCWA
HolE Inverse Relations
KG2E MRL False
ProjE True
RESCAL
RotatE NSSAL
SimplE
TransD
TransE SPL (LCWA)
TransH
TransR
TuckER SPL (sLCWA)
UM
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10 hits@10

Fig. 12. Impact of training approach on the performance for a fixed Fig. 14. Impact of explicitly modeling inverse relations on the perfor-
interaction model and loss function for the WN18RR dataset. mance for a fixed loss function for the WN18RR dataset.

7.3.0.1 Impact of the Training Approach: Figure 17 7.3.0.2 Impact of the Loss Function: Figure 16
shows that for both BCEL and SPL (we focus here only on illustrates that for all loss functions, comparable results
these two loss functions since they have been trained with can be achieved, but they exhibit higher variance when
both training approaches), the LCWA obtains significantly trained together with the LCWA. However, some interaction
higher results, but they express a high variance at the models seem to be more sensitive to the usage of the loss
same time. Figures 18 and 34 (Appendix) illustrate that function. For instance, ComplEx, and RESCAL performed
some interaction models benefit from one of the training better when trained based on the CEL (Figure 34), whereas
approaches, whereas others show similar performance with translation distance models TransH, TransD and TransR
both. For instance, it can be observed that RotatE and TransE benefit from the MRL.
perform better in combination with the LCWA approach for 7.3.0.3 Impact of Explicitly Modeling Inverse Re-
both loss functions, whereas ERMLP achieves comparable lations: Figure 19 reveals as for the previous datasets that
results for both training approaches. in general, the usage of inverse relations is beneficial for
ALI et al. 16

BCEL CEL models. A full list is provided in Table 16 (in appendix).


ComplEx
ConvE Inverse Relations
ConvKB False
DistMult True
ERMLP
HolE 7.4 Results on the YAGO3-10 Dataset
KG2E
NTN
ProjE YAGO3-10 is the largest benchmark dataset in our study.
RESCAL
RotatE
SE
Therefore, it is of interest to investigate how the different
SimplE interaction models perform on a larger KG. As mentioned in
TransD
TransE
TransH the introduction of this chapter, we reduced the experimen-
TransR
TuckER tal setup for YAGO3-10 in order to reduce the computational
UM
complexity of our entire study. Figure 22 depicts the over-
MRL NSSAL all results obtained for YAGO3-10. Detailed results for all
ComplEx
ConvE
ConvKB
configurations are illustrated in Figure 31 in the appendix.
DistMult The results highlight the previous observation that the
ERMLP
HolE
KG2E performance of many KGEMs heavily depend on the choice
NTN
ProjE of its components, and is dataset-specific. For instance,
RESCAL
RotatE
SE
RotatE achieves the best performance but also exhibits
SimplE the largest variance. TransE, which was among the top-
TransD
TransE
TransH performing interaction models on WN18RR, performed
TransR
TuckER poorly on YAGO3-10. One might conclude that TransE
UM
0.0 0.2 0.4 0.6 0.8 1.0 performs better on smaller KGs, but the results obtained
hits@10
SPL on Kinships do not support this assumption. It should
ComplEx
ConvE
ConvKB
be taken into account that some interaction models might
DistMult benefit from the LCWA as observed for TransE on WN18RR.
ERMLP
HolE
KG2E Therefore, TransE might perform much better when trained
NTN
ProjE with the LCWA approach. Remarkably, ComplEx seems to
RESCAL
RotatE
SE
be robust for all sLCWA configurations. With regards to
SimplE
TransD
the loss functions, all loss functions except MRL obtain
TransE comparable results. Though, the MRL is more robust than
TransH
TransR
TuckER other loss functions.
UM
0.0 0.2 0.4 0.6 0.8 1.0 7.4.0.1 Impact of the Loss Function: Figure 23
hits@10
depicts the impact of the loss functions for fixed interaction
models. As observed for the previous datasets, some inter-
Fig. 15. Impact of explicitly modeling inverse relations on the perfor-
mance for a fixed interaction model and loss function for the WN18RR
action models perform better with specific loss functions.
dataset. For instance, RotatE and TransE obtain better performance
when trained with MRL and NSSAL, whereas ConvKB and
RESCAL benefit from SPL.
the training based on the LCWA approach. Focusing on the 7.4.0.2 Impact of Explicitly Modeling Inverse Re-
loss functions (Figure 20), it can be seen that especially the lations: Figure 24 shows the effect of explicitly modeling
CEL benefits from explicitly modeling inverse relations and inverse relations for fixed loss functions. Three key aspects
that the MRL obtains higher results when trained without can be observed. First, BCEL achieves its best performance
inverse relations. Besides, it is remarkable that the BCEL when trained with inverse relations. However, the median
as the only loss function when trained under the LCWA value is much lower compared to the median obtained
without explicitly modeling inverse relations outperforms by the configurations which do not use inverse relations,
the setting in which inverse relations are involved. indicating that it is more difficult to find an appropriate
Figure 21 depicts the effect of explicitly modeling inverse configuration. Second, MRL is the only loss function for
relations on the combinations of interaction models and which the addition of inverse relations is beneficial for both
loss functions. There are specific combinations of interaction performance and robustness. Third, the addition of inverse
models and loss functions that benefit from the addition of relations has almost no effect on SPL.
inverse relations. Among those are, for instance, ERMLP, Figure 25 illustrates the effect of the usage of inverse re-
RESCAL, and TuckER when trained based on BCEL. A lations on the interaction models. The figure reveals that the
second example can be seen for models that have been combinations of some interaction models and loss functions
trained based on SPL for which ComplEx and NTN improve are not impacted by explicitly modeling inverse relations
when trained without inverse relations. A very notable (e.g., ComplEx and ERMLP for all loss functions except
observation can be made for CEL: 14 out of 16 interaction MRL). In contrast, others have clear preferences, e.g., Con-
models benefit from the addition of inverse relations. vKB suffers from the usage of inverse relations for BCEL
7.3.0.4 Model Complexity vs. Performance: Fig- and NSSAL and DistMult suffers from inverse relations on
ure 26 (Appendix) illustrates that for FB15K-237, there is no BCEL.
clear correlation between model size and performance. Tiny 7.4.0.3 Model Complexity vs. Performance: Fig-
models can already obtain similar performance as larger ure 26 (Appendix) expresses that there is a low correlation
models. The skyline comprises small UM models, some in- between model size and performance for YAGO3-10. How-
termediate TransE and DistMult models, and larger RotatE ever, the improvement is tiny compared to the differences
ALI et al. 17

1.0
Model Loss / Training Approach Inverse Relations

0.8

0.6

hits@10 0.4

0.2

0.0

nsD

Tra E
nsH

Tuc R
kER

e
a)

a)

a)

)
ER ult
P
lE
2E
N

RE E
Ro L
tatE

UM

e
Dis lEx

Tra E

wa

Tru
A

CE

MR

SA

s
ns
ML

pl

ns
NT

(lcw

(ow

(lcw
Ho

Pro
tM

Fal
KG

SC
mp

Sim

L (o
Tra

Tra

NS
EL
Co

EL

SP
SP
BC
BC
Fig. 16. Overall hits@10 results for FB15K-237 where box-plots summarize the results across different combinations of interaction models, training
approaches, loss functions, and the explicit usage of inverse relations.

Training Approach Inverse Relations


False
BCEL LCWA True
sLCWA
LCWA

CEL

MRL

sLCWA
NSSAL

0.0 0.2 0.4 0.6 0.8 1.0


SPL hits@10

0.0 0.2 0.4 0.6 0.8 1.0


hits@10 Fig. 19. Impact of explicitly modeling inverse relations on the perfor-
mance for a fixed training approach for the FB15K-237 dataset.

Fig. 17. Impact of the training approach on the performance for a fixed
loss function for the FB15K-237 dataset.
BCEL (LCWA) Inverse Relations
False
BCEL SPL True
ComplEx BCEL (sLCWA)
Training Approach
DistMult LCWA
ERMLP sLCWA CEL
HolE
KG2E
NTN MRL
ProjE
RESCAL NSSAL
RotatE
SimplE
TransD SPL (LCWA)
TransE
TransH
SPL (sLCWA)
UM
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10 hits@10

Fig. 18. Impact of training approach on the performance for a fixed


interaction model and loss function for the FB15K-237 dataset. Fig. 20. Impact of explicitly modeling inverse relations on the perfor-
mance for a fixed loss function for the FB15K-237 dataset.

in model size. It should be taken into account that for


KGEMs, the model size is usually dependent on the num- the re-implementation of all machine learning components
ber of entities and relations. Therefore, dependent on the into a unified, fully configurable framework was a major
space complexity of the interaction model (Table 1), the size effort, we believe it is essential to analyze reproducibility
can grow fast for large KGs. The skyline comprises small and obtain fair results on benchmarking. In particular, we
TransE, DistMult and ConvKB models, some intermediate were able to address the issue of incompatible evaluation
ERMLP, and larger ComplEx and RotatE models. A full list procedures and preprocessing steps in previous publica-
is provided in Table 19 (in appendix). tions that are not obvious.
During our reproducibility study, we found that the re-
production of experiments is a major challenge and, in many
8 D ISCUSSION & F UTURE W ORK cases, not possible with the available information in current
Table 5 illustrates the extent of our studies and Table 6 (Ap- publications. In particular, we observed the following four
pendix) summarizes the main findings our work. Although main aspects:
ALI et al. 18

BCEL CEL may lead to incorrect conclusions about the model’s


ComplEx Inverse Relations
DistMult
ERMLP False performance.
HolE True
KG2E
NTN Our benchmarking study shows that the term KGEM
ProjE should be used with caution and should be differentiated
RESCAL
RotatE
SimplE
from the actual interaction model since our results highlight
TransD that the specific combination of the interaction model, train-
TransE
TransH ing approach, loss function, and the usage of explicit inverse
TransR
TuckER relations is often fundamental for the performance.
UM
No configuration performs best across all datasets. De-
MRL NSSAL pending on the dataset, several configurations can be found
ComplEx
DistMult that achieve comparable results (Tables 20-23, and Fig-
ERMLP
HolE ures 28-31). Moreover, with an appropriate configuration,
KG2E
NTN the model size can significantly be compressed (see Pareto-
ProjE optimal configurations in Tables 16-19) that has especially
RESCAL
RotatE
SimplE
a practical relevance when looking for a trade-off between
TransD required memory and performance.
TransE
TransH The results also highlight that even interaction models
TransR
TuckER such as TransE that have been considered as baselines can
UM
0.0 0.2 0.4 0.6 0.8 1.0 outperform state-of-the-art interaction models when trained
hits@10
SPL with an appropriate training approach and loss function.
ComplEx
DistMult This raises the question of the necessity of the vast number
ERMLP
HolE of available interaction models. However, for some inter-
KG2E
NTN action models such as RotatE or TuckER, we can observe
ProjE a good performance across all datasets (note: TuckER has
RESCAL
RotatE not been evaluated on YAGO3-10). For RotatE, we even
SimplE
TransD
TransE
obtained the state-of-the-art results on WN18RR (similar
TransH results were obtained by Graph Attenuated Attention Net-
TransR
TuckER works [44]), and for ConvE and TuckER, we obtained results
UM
0.0 0.2 0.4 0.6 0.8 1.0 superior to the originally published ones. ComplEx proved
hits@10
to be a very robust interaction model across different config-
urations. This can, in particular, be observed from the results
Fig. 21. Impact of explicitly modeling inverse relations on the perfor- obtained on YAGO3-10 (Figure 22).
mance for a fixed interaction model and loss function for the FB15K-237
dataset. We discovered that no loss function consistently achieves
the best results. Instead it can be seen that with different loss
TABLE 5 functions such as the BCEL, NSSAL, and SPL good results
Evaluation statistics can be obtained across all datasets. Remarkably, the MRL
is overall the worst-performing loss function. However, one
Metric Value might argue that the MRL is the most compatible loss func-
tion with the sLCWA since it does not assume artificially
Datasets 4
Interaction Models 19
generated negative examples to be actually false in contrast
Training approaches 2 to the other loss functions used. The MRL only learns to
Loss Functions 5 score positive examples higher than corresponding negative
Negative Samplers 1 examples, but it does not ensure that a negative example is
Optimizers 2 scored lower than every other positive example. Thus, the
Ablation Studies 1,035 absolute score values are not interpretable and cannot be
Number of Experiments 65,301 used to compared triples without common head/tail enti-
Compute Time (hours) 21,246 ties. They can only be interpreted relatively, and only when
comparing scores for triples with the same (hr)/(rt). Al-
though loss functions such as BCEL or SPL treat generated
• For a set of experiments, the results can sometimes negative triples as true negatives that actually contain also
only be reproduced with a different set of hyper- unknown positive examples, they obtain good performance.
parameter values. This might be explained by the fact that usually the set of
• For some experiments, the entire experimental setup unknown triples are dominated by false triples. Therefore, it
was not provided, impeding the reproduction of is likely that a major part of the generated triples are actually
experiments. negative. Consequently, the KGEM learns to distinguish
• The lack of an official implementation hampers the better positive from negative examples.
reproduction of results. Considering the explicit usage of inverse relations, we
• Some results are dependent on the utilized rank- found out that the impact of inverse relations can be sig-
ing approach (average, optimistic, and pessimistic nificant, especially when the interaction model is trained
ranking approach). For example, the optimistic rank under the LCWA. This might be explained by the fact that
ALI et al. 19

1.0
Model Loss / Training Approach Inverse Relations

0.8

0.6

hits@10 0.4

0.2

0.0

nsD

nsE

e
)

)
B

lt

lE

AL
tatE

SE

e
lEx

wa

wa
tMu

Tru
MR

SA

s
nvK

ML

Ho

Fal
SC
mp

L (o

L (o
Tra
Tra
Ro

NS
ER
Dis
Co

RE
Co

SP
BC
Fig. 22. Overall hits@10 results for YAGO3-10 where box-plots summarize the results across different combinations of interaction models, training
approaches, loss functions, and the explicit usage of inverse relations. In contrast, to the previous datasets, the models have only been trained
based on the stochastic local closed world assumption.

BCEL MRL
ComplEx
ConvKB BCEL
DistMult
ERMLP
HolE
RESCAL MRL
RotatE Inverse Relations
SE
False
True
TransD
NSSAL
TransE
NSSAL SPL
ComplEx
ConvKB SPL
DistMult
ERMLP
0.0 0.2 0.4 0.6 0.8 1.0
HolE hits@10
RESCAL
RotatE
SE Fig. 24. Impact of explicitly modeling inverse relations on the perfor-
TransD mance for a fixed loss function for the YAGO3-10 dataset.
TransE
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10
provides an empirical foundation for such studies and a
Fig. 23. Impact of the loss functions on the performance for a fixed practical tool to execute them.
interaction models for the YAGO3-10 dataset. Because this dataset was
only trained under the sLCWA, the boxers are not split like for the other
datasets.
ACKNOWLEDGMENT
We want to thank the Center for Information Services and
High Performance Computing (ZIH) at TU Dresden for
based on the LCWA-training, the KGEM only learns to generous allocations of computer time and the Technical
perform one-side predictions (i.e., it learns to either predict University of Denmark for providing us access to their
head or tail entities), but during the evaluation, it is asked DTU Compute GPU cluster that enabled us to conduct our
to perform both-side predictions. Through the inclusion of studies. This work was funded by the German Federal Min-
inverse relations, the model learns to perform both-side istry of Education and Research (BMBF) under Grant No.
predictions based on one side, i.e., (∗, r, t) can be predicted 01IS18036A and Grant No. 01IS18050D (project “MLWin”)
through (t, rinverse , ∗). Overall, our results indicate that as well as the Innovation Fund Denmark with the Danish
further investigations on FB15K-237 and YAGO3-10 might Center for Big Data Analytics driven Innovation (DABAI).
lead to results that are competitive to the state-of-the-art.
Looking forwards, it would be of great interest to re- R EFERENCES
investigate previously performed studies that analyze the
[1] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph em-
relationship between the performance of KGEMs and the bedding: A survey of approaches and applications,” IEEE Trans.
properties of the underlying KGs to verify that their findings Knowl. Data Eng., vol. 29, no. 12, pp. 2724–2743, 2017.
indeed can be attributed to the interaction model alone, rather [2] F. Akrami, L. Guo, W. Hu, and C. Li, “Re-evaluating embedding-
based knowledge graph completion methods,” in CIKM. ACM,
than the exact configuration including the loss function, 2018, pp. 1779–1782.
the training approach and the explicit modeling of inverse [3] R. Kadlec, O. Bajgar, and J. Kleindienst, “Knowledge base comple-
relations. Further, the effect of explicitly modeling inverse tion: Baselines strike back,” in Rep4NLP@ACL. Association for
relations has not been analyzed in depth, in particular how Computational Linguistics, 2017, pp. 69–74.
[4] Z. Sun, S. Vashishth, S. Sanyal, P. P. Talukdar, and Y. Yang, “A
the learned representations of a relation and its inverse re-evaluation of knowledge graph completion methods,” in ACL.
are related to each other. Ultimately, we believe our work Association for Computational Linguistics, 2020, pp. 5516–5522.
ALI et al. 20

BCEL MRL [18] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity and
ComplEx relation embeddings for knowledge graph completion,” in AAAI.
ConvKB AAAI Press, 2015, pp. 2181–2187.
DistMult [19] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao, “Knowledge graph
embedding via dynamic mapping matrix,” in ACL (1). The
ERMLP
Association for Computer Linguistics, 2015, pp. 687–696.
HolE [20] M. Nickel, V. Tresp, and H. Kriegel, “A three-way model for
RESCAL collective learning on multi-relational data,” in ICML. Omnipress,
RotatE 2011, pp. 809–816.
SE [21] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard,
Inverse Relations “Complex embeddings for simple link prediction,” in ICML, ser.
TransD False JMLR Workshop and Conference Proceedings, vol. 48. JMLR.org,
TransE True
2016, pp. 2071–2080.
NSSAL SPL [22] Z. Sun, Z. Deng, J. Nie, and J. Tang, “Rotate: Knowledge graph
ComplEx embedding by relational rotation in complex space,” in ICLR
(Poster). OpenReview.net, 2019.
ConvKB
[23] S. M. Kazemi and D. Poole, “Simple embedding for link prediction
DistMult in knowledge graphs,” in NeurIPS, 2018, pp. 4289–4300.
ERMLP [24] I. Balazevic, C. Allen, and T. M. Hospedales, “Tucker: Tensor fac-
HolE torization for knowledge graph completion,” in EMNLP/IJCNLP
RESCAL (1). Association for Computational Linguistics, 2019, pp. 5184–
5193.
RotatE
[25] L. R. Tucker et al., “The extension of factor analysis to three-
SE dimensional matrices,” Contributions to mathematical psychology,
TransD vol. 110119, 1964.
TransE [26] B. Shi and T. Weninger, “Proje: Embedding projection for knowl-
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
edge graph completion,” in AAAI. AAAI Press, 2017, pp. 1236–
hits@10 hits@10 1242.
[27] M. Nickel, L. Rosasco, and T. A. Poggio, “Holographic embed-
dings of knowledge graphs,” in AAAI. AAAI Press, 2016, pp.
Fig. 25. Impact of explicitly modeling inverse relations on the perfor- 1955–1961.
mance for a fixed interaction model and loss function for the YAGO3-10 [28] S. He, K. Liu, G. Ji, and J. Zhao, “Learning to represent knowledge
dataset. graphs with gaussian embedding,” in CIKM. ACM, 2015, pp.
623–632.
[29] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy,
[5] B. Yang, W. Yih, X. He, J. Gao, and L. Deng, “Embedding entities T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: a web-
and relations for learning and inference in knowledge bases,” in scale approach to probabilistic knowledge fusion,” in KDD. ACM,
ICLR (Poster), 2015. 2014, pp. 601–610.
[6] F. Akrami, M. S. Saeef, Q. Zhang, W. Hu, and C. Li, “Realistic re- [30] R. Socher, D. Chen, C. D. Manning, and A. Y. Ng, “Reasoning with
evaluation of knowledge graph completion methods: An experi- neural tensor networks for knowledge base completion,” in NIPS,
mental study,” in SIGMOD Conference. ACM, 2020, pp. 1995–2010. 2013, pp. 926–934.
[7] S. K. Mohamed, V. Novácek, P. Vandenbussche, and E. Muñoz, [31] D. Q. Nguyen, T. D. Nguyen, D. Q. Nguyen, and D. Phung, “A
“Loss functions in knowledge graph embedding models,” in novel embedding model for knowledge base completion based
DL4KG@ESWC, ser. CEUR Workshop Proceedings, vol. 2377. on convolutional neural network,” arXiv preprint arXiv:1712.02121,
CEUR-WS.org, 2019, pp. 1–10. 2017.
[8] D. Ruffinelli, S. Broscheit, and R. Gemulla, “You CAN teach an [32] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolu-
old dog new tricks! on training knowledge graph embeddings,” tional 2d knowledge graph embeddings,” in AAAI. AAAI Press,
in ICLR. OpenReview.net, 2020. 2018, pp. 1811–1818.
[9] A. Rossi, D. Firmani, A. Matinata, P. Merialdo, and D. Barbosa, [33] H. Zhang, Z. Kyaw, S. Chang, and T. Chua, “Visual translation
“Knowledge graph embedding for link prediction: A comparative embedding network for visual relation detection,” in CVPR. IEEE
analysis,” CoRR, vol. abs/2002.00819, 2020. Computer Society, 2017, pp. 3107–3115.
[10] M. Berrendorf, E. Faerman, L. Vermue, and V. Tresp, “Interpretable [34] S. Sharifzadeh, M. Berrendorf, and V. Tresp, “Improving visual
and fair comparison of link prediction or entity alignment meth- relation detection using depth maps,” CoRR, vol. abs/1905.00966,
ods with adjusted mean rank,” CoRR, vol. abs/2002.06914, 2020. 2019.
[11] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich, “A review [35] T. Lacroix, N. Usunier, and G. Obozinski, “Canonical tensor
of relational machine learning for knowledge graphs,” Proc. IEEE, decomposition for knowledge base completion,” in ICML, ser.
vol. 104, no. 1, pp. 11–33, 2016. Proceedings of Machine Learning Research, vol. 80. PMLR, 2018,
[12] B. Kotnis and V. Nastase, “Analysis of the impact of negative pp. 2869–2878.
sampling on link prediction in knowledge graphs,” CoRR, vol. [36] N. Fuhr, “Some common mistakes in IR evaluation, and how they
abs/1708.06816, 2017. can be avoided,” SIGIR Forum, vol. 51, no. 3, pp. 32–41, 2017.
[13] L. A. Galárraga, C. Teflioudi, K. Hose, and F. Suchanek, “Amie: [37] S. S. Stevens et al., “On the theory of scales of measurement,” 1946.
association rule mining under incomplete evidence in ontological [38] K. Toutanova and D. Chen, “Observed versus latent features for
knowledge bases,” in Proceedings of the 22nd international conference knowledge base and text inference,” in Proceedings of the 3rd Work-
on World Wide Web, 2013, pp. 413–422. shop on Continuous Vector Space Models and their Compositionality,
[14] A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “A semantic 2015, pp. 57–66.
matching energy function for learning with multi-relational data - [39] W. W. Denham, “The detection of patterns in alyawara nonverbal
application to word-sense disambiguation,” Mach. Learn., vol. 94, behavior,” Ph.D. dissertation, University of Washington, Seattle.,
no. 2, pp. 233–259, 2014. 1973.
[15] A. Bordes, J. Weston, R. Collobert, and Y. Bengio, “Learning [40] R. J. Rummel, The dimensionality of nations project: attributes of
structured embeddings of knowledge bases,” in AAAI. AAAI nations and behavior of nations dyads, 1950-1965. Inter-university
Press, 2011. Consortium for Political Research, 1976, no. 5409.
[16] A. Bordes, N. Usunier, A. Garcı́a-Durán, J. Weston, and [41] A. T. McCray, “An upper-level ontology for the biomedical do-
O. Yakhnenko, “Translating embeddings for modeling multi- main,” International Journal of Genomics, vol. 4, no. 1, pp. 80–84,
relational data,” in NIPS, 2013, pp. 2787–2795. 2003.
[17] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph [42] T. Rebele, F. M. Suchanek, J. Hoffart, J. Biega, E. Kuzey, and
embedding by translating on hyperplanes,” in AAAI. AAAI Press, G. Weikum, “YAGO: A multilingual knowledge base from
2014, pp. 1112–1119. wikipedia, wordnet, and geonames,” in International Semantic Web
ALI et al. 21

Conference (2), ser. Lecture Notes in Computer Science, vol. 9982,


2016, pp. 177–185.
[43] F. Mahdisoltani, J. Biega, and F. M. Suchanek, “YAGO3:
A knowledge base from multilingual wikipedias,” in CIDR.
www.cidrdb.org, 2015.
[44] R. Wang, B. Li, S. Hu, W. Du, and M. Zhang, “Knowledge graph
embedding via graph attenuated attention networks,” IEEE Access,
vol. 8, pp. 5212–5224, 2020.
ALI et al. A1

A PPENDIX

TABLE 6
Summary of main insights over all datasets. Each component (i.e., interaction model, loss function, and training approach) is considered to be
among the top-ten performing configurations when they occur at least once in the top-ten performing configurations. Note that a single component
is part of several configurations, and therefore, can occur multiple times in the top-ten performing configurations.

Interaction Models
RotatE Among top-ten-performing interaction models across all datasets.
TransE Among top-ten-performing interaction models on WN18RR and FB15K-237.
DistMult Among top-ten-performing interaction models on FB15K-237.
TuckER Among top-ten-performing interaction models for Kinships, and FB15K-237 (has not been evaluated on YAGO3-10).
ComplEx Among top-ten-performing interaction models on YAGO3-10.
ConvE Among top-ten-performing interaction models on WN18RR (has not been evaluated on FB15K-237 and YAGO3-10).
HolE Among top-ten-performing interaction models on YAGO3-10.
ERMLP Among top-ten-performing interaction models on YAGO3-10.
RESCAL Among top-ten-performing interaction models on FB15K-237.
SimplE Among top-ten-performing interaction models on Kinships.
Loss Functions
BCEL Among top-ten-performing loss functions across all datasets.
NSSAL Among top-ten-performing loss functions across all datasets.
SPL Among top-ten-performing loss functions across all datasets.
CEL Among top-ten-performing loss functions on Kinships and FB15K-237 (has not been evaluated on YAGO3-10).
MRL Among top-ten-performing loss functions on Kinships and FB15K-237.
Training Approaches
sLCWA Among top-ten-performing training approaches across all datasets.
LCWA Among top-ten-performing training approaches on Kinships, WN18RR and FB15K-237 (has not been evaluated on YAGO3-
10).
Explicit Modeling of Inverse Relations
Is usually beneficial in combination with the local closed world assumption.
Configurations
Performance Appropriate combination of interaction model, training assumption, loss function, choice of explicitly modeling inverse
relations is crucial for the performance, e.g., TransE can compete when with several state-of-the-art interaction models on
WN18RR when appropriate configuration is selected.
There is no single best configuration that works best for all dataset.
Variance Some interaction models exhibit a high variance across different configurations, e.g., RotatE on YAGO3-10 (Figure 22)
Pareto-Optimal Tables 16-19 describe Pareto-optimal configurations. It can be seen that there are configurations that require fewer
Configurations parameters while obtaining almost the same performance. In some cases, for the same interaction model, the model can be
significantly compressed.
Reproducibility
Results For FB15K, two out of 12, for WN18, six out of 12, for FB15K-237, one out of three, and for WN18RR, two out of three
experiments can be categorized as soft-reproducible.
Code For four out of 13 models, no official implementation was available.
Parameters For four out of 13 paper, source code was available and full experimental setup was precisely described.
General Insights
SOTA For WN18RR, we achieve based on a RotatE-configuration (together with Graph Attenuated Attention Networks [44]) state-
of-the-art results in terms of hits@10 through our study (60.09% Hits@10). Furthermore, we found a TransE configuration
that achieves high performance beating most of the published SOTA results (56.98% Hits@10). Based on our results, we
emphasize to further investigate the hyper-parameters space for the most promising configurations for the remaining
benchmarking datasets.
Improvements For ConvE (56.10% compared to 52.00% [32]) and TuckER (54.26% compared to 52.6% [24]), we are beating the reported
results in the original papers due selecting appropriate configurations and hyper-parameters on WN18RR.
ALI et al. A2

A DDITIONAL R ESULTS F ROM R EPRODUCIBILITY S TUDY

TABLE 7
Reproduction of Studies on FB15K without indicating the standard deviation where pub refers to published results, R to results based on the
realistic ranking, O to results based on the optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the
optimistic and pessimistic ranking in case they differ from the realistic ranking. Models highlighted in blue indicate the experiments are
soft-reproducible whereas models highlighted in red indicate the results represent extreme outliers. The full result table with standard deviation
across multiple runs can be found at Table 11.

MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ComplEx pub 69.2 59.9 75.9 84.
R 19.13 10.06 20.82 27.59 38.03 171.16 2.46
ConvE pub 65.7 55.8 72.3 83.1 51.
R 59.56 48.28 66.99 73.27 79.76 50.76 0.73
DistMult pub 35. 57.7
R 26.06 16.45 29.10 35.54 45.00 134.02 1.86
HolE pub 52.4 40.2 61.3 73.9
R 34.15 21.79 39.69 48.06 58.84 193.03 2.71
KG2E pub 71.5 59.
R 0.58 0.11 0.36 0.56 1.01 5779.07 78.40
RotatE pub 79.7 74.6 83. 88.4 40.
R 55.00 41.53 64.14 71.23 78.67 42.28 0.63
SimplE pub 72.7 66. 77.3 83.8
R 0.04 0.01 0.03 0.04 0.06 7395.75 100.02
O 23.90 11.58 24.16 34.73 54.28 139.34
P 0.03 0.01 0.03 0.04 0.06 14652.16
TransD pub 77.3 91.
R 33.99 21.22 40.48 48.57 58.71 153.37 2.29
TransE pub 47.1 125.
R 26.01 15.23 29.85 37.18 47.34 127.92 1.78
TransH pub 64.4 87.
R 2.54 1.69 2.95 3.29 3.74 6320.02 85.63
O 2.54 1.69 2.95 3.29 3.74 6320.00
P 2.54 1.69 2.95 3.29 3.74 6320.05
TransR pub 68.7 77.
R 0.65 0.37 0.63 0.78 1.03 6795.95 91.99
O 0.65 0.37 0.63 0.78 1.03 6795.94
P 0.65 0.37 0.63 0.78 1.03 6795.95
TuckER pub 79.5 74.1 83.3 89.2
R 0.07 0.01 0.02 0.03 0.15 7327.77 99.11

TABLE 8
Reproduction of Studies on FB15K-237 without indicating the standard deviation where pub refers to published results, R to results based on the
realistic ranking, O to results based on the optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the
optimistic and pessimistic ranking in case they differ from the realistic ranking. Models highlighted in blue indicate the experiments are
soft-reproducible whereas models highlighted in red indicate the results represent extreme outliers. The full result table with standard deviation
across multiple runs can be found at Table 12.

MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ConvE pub 32.5 23.7 35.6 50.1 244.
R 26.93 18.22 29.51 35.98 44.95 255.46 3.73
ConvKB pub 39.6 51.7 257.
R 4.71 3.31 4.04 4.57 7.76 4345.27 61.36
RotatE pub 33.8 24.1 37.5 53.3 177.
R 26.42 17.57 28.97 35.29 44.55 191.92 2.84
ALI et al. A3

TABLE 9
Reproduction of Studies on WN18 without indicating the standard deviation where pub refers to published results, R to results based on the
realistic ranking, O to results based on the optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the
optimistic and pessimistic ranking in case they differ from the realistic ranking. Models highlighted in blue indicate the experiments are
soft-reproducible whereas models highlighted in red indicate the results represent extreme outliers. The full result table with standard deviation
across multiple runs can be found at Table 13.

MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ComplEx pub 94.1 93.6 94.5 94.7
R 19.49 12.36 20.66 25.24 32.92 452.67 2.21
ConvE pub 94.3 93.5 94.6 95.6 374.
R 88.81 85.14 91.76 93.29 94.85 444.40 2.17
DistMult pub 83. 94.2
R 77.44 67.45 85.94 89.52 92.72 458.64 2.24
HolE pub 93.8 93. 94.5 94.9
R 70.44 59.29 79.29 84.12 88.61 812.64 3.97
O 70.44 59.29 79.29 84.12 88.61 812.63
P 70.44 59.29 79.29 84.12 88.61 812.64
KG2E pub 92.8 331.
R 3.61 1.35 3.21 4.57 7.02 2708.89 13.25
O 3.61 1.35 3.21 4.57 7.02 2708.88
P 3.61 1.35 3.21 4.57 7.02 2708.89
RotatE pub 94.9 94.4 95.2 95.9 309.
R 87.29 82.17 91.53 93.44 95.28 123.68 0.61
SimplE pub 94.2 93.9 94.4 94.7
R 0.04 0.01 0.03 0.04 0.07 20376.43 99.57
O 38.48 33.93 39.59 42.76 47.01 384.53
P 0.03 0.01 0.03 0.04 0.07 40368.33
TransD pub 92.2 212.
R 36.22 3.94 65.63 79.64 87.27 444.39 2.17
TransE pub 89.2 251.
R 39.19 9.99 64.74 75.44 84.25 468.24 2.29
TransH pub 82.3 388.
R 0.18 0.04 0.19 0.29 0.39 19678.04 96.16
O 0.18 0.04 0.19 0.29 0.39 19678.02
P 0.18 0.04 0.19 0.29 0.39 19678.06
TransR pub 92. 225.
R 0.06 0.00 0.04 0.05 0.11 19686.49 96.20
O 0.06 0.00 0.04 0.05 0.11 19686.49
P 0.06 0.00 0.04 0.05 0.11 19686.50
TuckER pub 95.3 94.9 95.5 95.8
R 0.03 0.00 0.02 0.03 0.04 20622.46 100.78

TABLE 10
Reproduction of Studies on WN18RR without indicating the standard deviation where pub refers to published results, R to results based on the
realistic ranking, O to results based on the optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the
optimistic and pessimistic ranking in case they differ from the realistic ranking. Models highlighted in blue indicate the experiments are
soft-reproducible whereas models highlighted in red indicate the results represent extreme outliers. The full result table with standard deviation
across multiple runs can be found at Table 14.

MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ConvE pub 43. 4. 44. 52. 4187.
R 44.69 40.98 46.49 48.92 51.76 5369.49 26.49
ConvKB pub 24.8 52.5 2554.
R 0.30 0.09 0.21 0.32 0.57 13634.66 67.27
O 0.30 0.09 0.21 0.32 0.57 13634.65
P 0.30 0.09 0.21 0.32 0.57 13634.66
RotatE pub 47.6 42.8 49.2 57.1 3340.
R 48.40 44.02 50.55 52.98 56.51 4263.32 21.03
ALI et al. A4

TABLE 11
Reproduction of Studies on FB15K where pub refers to published results, R to results based on the realistic ranking, O to results based on the
optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the optimistic and pessimistic ranking in case
they differ from the realistic ranking.

MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ComplEx pub 69.2 59.9 75.9 84.
R 19.13 ± 0.45 10.06 ± 0.42 20.82 ± 0.67 27.59 ± 0.94 38.03 ± 1.02 171.16 ± 16.05 2.46 ± 0.28
ConvE pub 65.7 55.8 72.3 83.1 51.
R 59.56 ± 0.06 48.28 ± 0.12 66.99 ± 0.04 73.27 ± 0.03 79.76 ± 0.07 50.76 ± 0.40 0.73 ± 0.01
DistMult pub 35. 57.7
R 26.06 ± 0.17 16.45 ± 0.16 29.10 ± 0.17 35.54 ± 0.21 45.00 ± 0.25 134.02 ± 1.98 1.86 ± 0.03
HolE pub 52.4 40.2 61.3 73.9
R 34.15 ± 0.22 21.79 ± 0.19 39.69 ± 0.24 48.06 ± 0.30 58.84 ± 0.28 193.03 ± 7.61 2.71 ± 0.12
KG2E pub 71.5 59.
R 0.58 ± 0.07 0.11 ± 0.04 0.36 ± 0.08 0.56 ± 0.10 1.01 ± 0.14 5779.07 ± 51.02 78.40 ± 0.68
RotatE pub 79.7 74.6 83. 88.4 40.
R 55.00 ± 0.06 41.53 ± 0.06 64.14 ± 0.07 71.23 ± 0.05 78.67 ± 0.08 42.28 ± 0.13 0.63 ± 0.00
SimplE pub 72.7 66. 77.3 83.8
R 0.04 ± 0.00 0.01 ± 0.00 0.03 ± 0.00 0.04 ± 0.00 0.06 ± 0.01 7395.75 ± 2.02 100.02 ± 0.03
O 23.90 ± 8.79 11.58 ± 6.42 24.16 ± 10.95 34.73 ± 13.40 54.28 ± 15.80 139.34 ± 49.45
P 0.03 ± 0.00 0.01 ± 0.00 0.03 ± 0.00 0.04 ± 0.00 0.06 ± 0.01 14652.16 ± 45.71
TransD pub 77.3 91.
R 33.99 ± 0.03 21.22 ± 0.03 40.48 ± 0.10 48.57 ± 0.09 58.71 ± 0.14 153.37 ± 5.35 2.29 ± 0.09
TransE pub 47.1 125.
R 26.01 ± 0.17 15.23 ± 0.16 29.85 ± 0.24 37.18 ± 0.24 47.34 ± 0.18 127.92 ± 0.86 1.78 ± 0.01
TransH pub 64.4 87.
R 2.54 ± 0.20 1.69 ± 0.25 2.95 ± 0.20 3.29 ± 0.22 3.74 ± 0.18 6320.02 ± 30.37 85.63 ± 0.40
O 2.54 ± 0.20 1.69 ± 0.25 2.95 ± 0.20 3.29 ± 0.22 3.74 ± 0.18 6320.00 ± 30.37
P 2.54 ± 0.20 1.69 ± 0.25 2.95 ± 0.20 3.29 ± 0.22 3.74 ± 0.18 6320.05 ± 30.37
TransR pub 68.7 77.
R 0.65 ± 0.02 0.37 ± 0.00 0.63 ± 0.04 0.78 ± 0.06 1.03 ± 0.07 6795.95 ± 16.65 91.99 ± 0.22
O 0.65 ± 0.02 0.37 ± 0.00 0.63 ± 0.04 0.78 ± 0.06 1.03 ± 0.07 6795.94 ± 16.65
P 0.65 ± 0.02 0.37 ± 0.00 0.63 ± 0.04 0.78 ± 0.06 1.03 ± 0.07 6795.95 ± 16.65
TuckER pub 79.5 74.1 83.3 89.2
R 0.07 ± 0.02 0.01 ± 0.00 0.02 ± 0.00 0.03 ± 0.01 0.15 ± 0.17 7327.77 ± 29.22 99.11 ± 0.39

TABLE 12
Reproduction of Studies on FB15K-237 where pub refers to published results, R to results based on the realistic ranking, O to results based on
the optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the optimistic and pessimistic ranking in case
they differ from the realistic ranking.

MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ConvE pub 32.5 23.7 35.6 50.1 244.
R 26.93 ± 0.11 18.22 ± 0.11 29.51 ± 0.24 35.98 ± 0.16 44.95 ± 0.17 255.46 ± 6.16 3.73 ± 0.13
ConvKB pub 39.6 51.7 257.
R 4.71 ± 0.23 3.31 ± 0.23 4.04 ± 0.19 4.57 ± 0.22 7.76 ± 0.88 4345.27 ± 46.99 61.36 ± 0.65
RotatE pub 33.8 24.1 37.5 53.3 177.
R 26.42 ± 0.04 17.57 ± 0.06 28.97 ± 0.05 35.29 ± 0.09 44.55 ± 0.06 191.92 ± 0.31 2.84 ± 0.00
ALI et al. A5

TABLE 13
Reproduction of Studies on WN18 where pub refers to published results, R to results based on the realistic ranking, O to results based on the
optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the optimistic and pessimistic ranking in case
they differ from the realistic ranking.

MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ComplEx pub 94.1 93.6 94.5 94.7
R 19.49 ± 2.55 12.36 ± 1.96 20.66 ± 2.75 25.24 ± 3.33 32.92 ± 4.40 452.67 ± 63.05 2.21 ± 0.31
ConvE pub 94.3 93.5 94.6 95.6 374.
R 88.81 ± 0.09 85.14 ± 0.10 91.76 ± 0.11 93.29 ± 0.04 94.85 ± 0.06 444.40 ± 14.82 2.17 ± 0.07
DistMult pub 83. 94.2
R 77.44 ± 0.22 67.45 ± 0.34 85.94 ± 0.21 89.52 ± 0.25 92.72 ± 0.18 458.64 ± 23.96 2.24 ± 0.12
HolE pub 93.8 93. 94.5 94.9
R 70.44 ± 0.45 59.29 ± 0.53 79.29 ± 0.47 84.12 ± 0.36 88.61 ± 0.42 812.64 ± 28.33 3.97 ± 0.14
O 70.44 ± 0.45 59.29 ± 0.53 79.29 ± 0.47 84.12 ± 0.36 88.61 ± 0.42 812.63 ± 28.33
P 70.44 ± 0.45 59.29 ± 0.53 79.29 ± 0.47 84.12 ± 0.36 88.61 ± 0.42 812.64 ± 28.34
KG2E pub 92.8 331.
R 3.61 ± 0.26 1.35 ± 0.22 3.21 ± 0.31 4.57 ± 0.34 7.02 ± 0.43 2708.89 ± 44.57 13.25 ± 0.22
O 3.61 ± 0.26 1.35 ± 0.22 3.21 ± 0.31 4.57 ± 0.34 7.02 ± 0.43 2708.88 ± 44.57
P 3.61 ± 0.26 1.35 ± 0.22 3.21 ± 0.31 4.57 ± 0.34 7.02 ± 0.43 2708.89 ± 44.57
RotatE pub 94.9 94.4 95.2 95.9 309.
R 87.29 ± 0.12 82.17 ± 0.20 91.53 ± 0.12 93.44 ± 0.07 95.28 ± 0.08 123.68 ± 1.71 0.61 ± 0.01
SimplE pub 94.2 93.9 94.4 94.7
R 0.04 ± 0.01 0.01 ± 0.01 0.03 ± 0.00 0.04 ± 0.01 0.07 ± 0.03 20376.43 ± 42.30 99.57 ± 0.21
O 38.48 ± 4.00 33.93 ± 4.32 39.59 ± 4.67 42.76 ± 3.73 47.01 ± 2.66 384.53 ± 66.45
P 0.03 ± 0.01 0.01 ± 0.01 0.03 ± 0.00 0.04 ± 0.01 0.07 ± 0.03 40368.33 ± 114.95
TransD pub 92.2 212.
R 36.22 ± 0.12 3.94 ± 0.27 65.63 ± 0.55 79.64 ± 0.43 87.27 ± 0.41 444.39 ± 25.61 2.17 ± 0.13
TransE pub 89.2 251.
R 39.19 ± 1.21 9.99 ± 1.82 64.74 ± 0.91 75.44 ± 0.48 84.25 ± 0.33 468.24 ± 13.64 2.29 ± 0.07
TransH pub 82.3 388.
R 0.18 ± 0.04 0.04 ± 0.02 0.19 ± 0.10 0.29 ± 0.14 0.39 ± 0.11 19678.04 ± 18.92 96.16 ± 0.09
O 0.18 ± 0.04 0.04 ± 0.02 0.19 ± 0.10 0.29 ± 0.14 0.39 ± 0.11 19678.02 ± 18.92
P 0.18 ± 0.04 0.04 ± 0.02 0.19 ± 0.10 0.29 ± 0.14 0.39 ± 0.11 19678.06 ± 18.92
TransR pub 92. 225.
R 0.06 ± 0.02 0.00 ± 0.00 0.04 ± 0.02 0.05 ± 0.03 0.11 ± 0.06 19686.49 ± 100.97 96.20 ± 0.49
O 0.06 ± 0.02 0.00 ± 0.00 0.04 ± 0.02 0.05 ± 0.03 0.11 ± 0.06 19686.49 ± 100.97
P 0.06 ± 0.02 0.00 ± 0.00 0.04 ± 0.02 0.05 ± 0.03 0.11 ± 0.06 19686.50 ± 100.97
TuckER pub 95.3 94.9 95.5 95.8
R 0.03 ± 0.01 0.00 ± 0.00 0.02 ± 0.01 0.03 ± 0.03 0.04 ± 0.03 20622.46 ± 153.52 100.78 ± 0.75

TABLE 14
Reproduction of Studies on WN18RR where pub refers to published results, R to results based on the realistic ranking, O to results based on the
optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the optimistic and pessimistic ranking in case
they differ from the realistic ranking.

MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ConvE pub 43. 4. 44. 52. 4187.
R 44.69 ± 0.21 40.98 ± 0.22 46.49 ± 0.14 48.92 ± 0.23 51.76 ± 0.13 5369.49 ± 50.92 26.49 ± 0.25
ConvKB pub 24.8 52.5 2554.
R 0.30 ± 0.07 0.09 ± 0.03 0.21 ± 0.07 0.32 ± 0.10 0.57 ± 0.16 13634.66 ± 714.24 67.27 ± 3.52
O 0.30 ± 0.07 0.09 ± 0.03 0.21 ± 0.07 0.32 ± 0.10 0.57 ± 0.16 13634.65 ± 714.24
P 0.30 ± 0.07 0.09 ± 0.03 0.21 ± 0.07 0.32 ± 0.10 0.57 ± 0.16 13634.66 ± 714.24
RotatE pub 47.6 42.8 49.2 57.1 3340.
R 48.40 ± 0.09 44.02 ± 0.15 50.55 ± 0.12 52.98 ± 0.11 56.51 ± 0.26 4263.32 ± 90.33 21.03 ± 0.45
ALI et al. A6

TABLE 15
Model sizes in bytes for the best reported configurations studied for the the reproducibility study.

Dataset FB15K FB15K-237 WN18 WN18RR


Model
ComplEx 26.1 MB - 49.2 MB -
ConvE 22.5 MB 20.3 MB 41.2 MB 40.9 MB
ConvKB - 5.9 MB - 8.2 MB
DistMult 6.5 MB - 16.4 MB -
HolE 9.8 MB - 24.6 MB -
KG2E 6.5 MB - 16.4 MB -
RotatE 130.4 MB 117.9 MB 163.8 MB 162.3 MB
SimplE 26.1 MB - 65.5 MB -
TransD 6.5 MB - 16.4 MB -
TransE 3.3 MB - 3.3 MB -
TransH 7.1 MB - 8.2 MB -
TransR 16.7 MB - 8.4 MB -
TuckER 46.1 MB - 37.6 MB -
ALI et al. A7

A DDITIONAL R ESULTS F ROM B ENCHMARKING S TUDY

TABLE 16
Pareto-optimal models for FB15K-237 regarding Model Bytes and Hits@10

Model Loss Training Approach Inverse Relations Model Bytes Hits@10 (%)
RotatE NSSAL sLCWA no 28.8 MiB 48.060
RotatE NSSAL sLCWA yes 14.6 MiB 47.742
DistMult CEL LCWA yes 3.7 MiB 46.032
TransE SPL LCWA no 3.6 MiB 43.722
UM MRL sLCWA yes 3.5 MiB 3.166
UM MRL sLCWA no 3.5 MiB 3.075

TABLE 17
Pareto-optimal models for Kinships regarding Model Bytes and Hits@10

Model Loss Training Approach Inverse Relations Model Bytes Hits@10 (%)
RotatE MRL sLCWA no 258.0 KiB 98.743
RotatE NSSAL sLCWA yes 77.0 KiB 98.417
ProjE NSSAL sLCWA yes 39.3 KiB 95.438
ProjE SPL sLCWA no 33.0 KiB 94.088
HolE SPL LCWA no 32.2 KiB 87.384
UM SPL LCWA no 26.0 KiB 10.335
UM MRL sLCWA no 26.0 KiB 6.844

TABLE 18
Pareto-optimal models for WN18RR regarding Model Bytes and Hits@10

Model Loss Training Approach Inverse Relations Model Bytes Hits@10 (%)
RotatE SPL LCWA yes 79.3 MiB 60.089
RotatE BCEL LCWA yes 19.8 MiB 59.337
ConvE SPL LCWA yes 10.6 MiB 53.129
ConvKB SPL sLCWA yes 5.0 MiB 41.365
ConvKB NSSAL sLCWA no 2.5 MiB 20.332

TABLE 19
Pareto-optimal models for YAGO3-10 regarding Model Bytes and Hits@10

Model Loss Training Approach Inverse Relations Training Time Hits@10 (%)
RotatE NSSAL sLCWA yes 1.96 h 62.595
ComplEx SPL sLCWA no 53.89 min 62.224
ComplEx BCEL sLCWA no 37.45 min 59.916
TransE MRL sLCWA no 12.26 min 40.837
TransE SPL sLCWA no 11.03 min 16.128
ALI et al. A8

TABLE 20
Best configuration for each model in FB15K-237

Model Loss Training Approach Inverse Relations Hits@10 (%)


ComplEx CEL LCWA True 0.429470
DistMult CEL LCWA True 0.460319
ERMLP BCEL LCWA True 0.430962
HolE CEL LCWA True 0.406571
KG2E SPL LCWA True 0.431647
NTN SPL sLCWA False 0.195983
ProjE CEL LCWA True 0.401458
RESCAL CEL LCWA True 0.445249
RotatE NSSAL sLCWA False 0.480600
SimplE CEL LCWA True 0.386731
TransD MRL sLCWA True 0.394755
TransE CEL LCWA True 0.442827
TransH MRL sLCWA False 0.337998
TransR CEL LCWA True 0.384064
TuckER SPL LCWA True 0.450435
UM CEL LCWA False 0.079974

TABLE 21
Best configuration for each model in Kinships

Model Loss Training Approach Inverse Relations Hits@10 (%)


ComplEx BCEL sLCWA True 0.979050
ConvE NSSAL sLCWA True 0.982775
ConvKB NSSAL sLCWA True 0.969739
DistMult CEL LCWA True 0.922719
ERMLP NSSAL sLCWA True 0.968808
HolE CEL LCWA True 0.927840
KG2E MRL sLCWA True 0.910615
NTN BCEL sLCWA True 0.921788
ProjE NSSAL sLCWA True 0.954376
RESCAL SPL sLCWA False 0.975791
RotatE MRL sLCWA False 0.987430
SE NSSAL sLCWA True 0.980912
SimplE BCEL sLCWA True 0.981378
TransD CEL LCWA True 0.455773
TransE CEL LCWA True 0.912942
TransH CEL LCWA True 0.523743
TransR SPL sLCWA False 0.715549
TuckER BCEL LCWA True 0.985102
UM BCEL sLCWA False 0.109404

TABLE 22
Best configuration for each model in WN18RR

Model Loss Training Approach Inverse Relations Hits@10 (%)


ComplEx CEL LCWA False 0.534884
ConvE CEL LCWA True 0.561047
ConvKB NSSAL sLCWA True 0.468707
DistMult CEL LCWA True 0.529583
ERMLP BCEL sLCWA True 0.475205
HolE CEL LCWA False 0.507866
KG2E SPL LCWA False 0.507182
NTN MRL sLCWA False 0.313440
ProjE CEL LCWA True 0.517442
RESCAL CEL LCWA True 0.531977
RotatE SPL LCWA True 0.600889
SE SPL sLCWA False 0.453830
SimplE CEL LCWA True 0.509234
TransD MRL sLCWA False 0.463064
TransE SPL LCWA False 0.569767
TransH MRL sLCWA False 0.478112
TransR MRL sLCWA False 0.425274
TuckER CEL LCWA True 0.542579
UM SPL LCWA False 0.450410
ALI et al. A9

TABLE 23
Best configuration for each model in YAGO3-10

Model Loss Training Approach Inverse Relations Hits@10 (%)


ComplEx BCEL sLCWA True 0.625151
ConvKB SPL sLCWA True 0.578081
DistMult MRL sLCWA True 0.549077
ERMLP SPL sLCWA True 0.597852
HolE BCEL sLCWA False 0.593135
RESCAL SPL sLCWA True 0.550080
RotatE NSSAL sLCWA True 0.625953
SE NSSAL sLCWA True 0.290044
TransD MRL sLCWA False 0.358691
TransE MRL sLCWA True 0.485448

FB15k237 Kinships
0.5 1.0

0.4 0.8

0.3 0.6
hits@10

0.2
0.4

0.1
0.2 Model
ComplEx
0.0 ConvE
0.0
10
7
10
8
10
5
10
6
10
7 ConvKB
DistMult
ERMLP
WN18RR YAGO310 HolE
0.6 ProjE
0.6
RotatE
0.5 0.5 TransE
UM
0.4 0.4
hits@10

0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
7 8 7 7 7 8 8
10 10 3 × 10 4 × 10 6 × 10 10 2 × 10
Model Bytes Model Bytes

Fig. 26. Scatter plots comparing model size in number of bytes and model performance in terms of Hits@10 for all trained models on each dataset.
The color indicates the model type, and the model size is shown on a logarithmic axis. Pareto-optimal models are highlighted by cross symbols. In
general we only see a low correlation between model size and performance. A more thorough comparison can be found in Figures 4, 10, 16, and
22.
ALI et al. A10

training_approach = LCWA training_approach = sLCWA


ComplEx 28.58% 96.88% 18.53% ComplEx 97.77% 96.14% 97.49% 97.72%
ConvE 93.58% 95.16% 93.02% ConvE 94.09% 95.39% 96.55% 94.51%
ConvKB 19.60% 74.30% 16.85% ConvKB 76.91% 91.90% 88.59% 89.29%
DistMult 10.15% 81.80% 11.03% DistMult 57.73% 83.99% 58.10% 55.91%
ERMLP 70.58% 56.01% 73.09% ERMLP 87.71% 89.34% 90.08% 89.76%
HolE 86.27% 88.08% 86.69% HolE 86.50% 86.50% 86.73% 85.61%

inverse_relations = False
KG2E 10.06% 64.29% 25.09% KG2E 10.66% 88.87% 89.48% 48.74%
NTN 10.20% 9.87% 10.01% NTN 11.50% 10.75% 9.17% 9.92%
ProjE 17.46% 81.42% 16.20% ProjE 85.20% 85.06% 91.62% 86.59%
RESCAL 31.10% 90.76% 88.97% RESCAL 86.45% 93.11%
RotatE 11.31% 96.93% 12.38% RotatE 90.18% 97.58% 95.76% 88.08%
SE 94.83% 93.48% 95.34% SE 96.79%
SimplE 53.35% 93.95% 46.00% SimplE 92.69% 89.90% 93.30% 93.90%
TransD 23.60% 37.29% 22.07% TransD 31.89% 35.34% 24.58% 23.42%
TransE 69.55% 82.26% 60.01% TransE 67.64% 85.43% 87.38% 68.99%
TransR 55.26% 61.41% 51.02% TransR 68.67% 69.60% 63.13% 69.18%
TuckER 94.65% 96.07% 94.41% TuckER 96.65% 97.72%
UM 9.87% 9.68% 9.17% UM 10.94% 4.70% 6.56% 8.19%
BCEL CEL SPL BCEL MRL NSSAL SPL
ComplEx 84.31% 97.49% 83.99% ComplEx 97.63% 96.32% 97.86% 97.44%
ConvE 97.63% 98.04% 97.25% ConvE 96.65% 96.83% 98.28% 97.25%
ConvKB 23.09% 84.78% 21.28% ConvKB 91.11% 95.34% 89.34% 83.57%
DistMult 10.66% 85.66% 10.94% DistMult 55.87% 86.36% 55.17% 55.45%
ERMLP 84.45% 89.76% 87.24% ERMLP 93.30% 94.09% 95.34% 94.18%
HolE 86.92% 92.78% 87.24% HolE 85.89% 86.03% 86.31% 86.13%

inverse_relations = True
KG2E 9.73% 88.08% 32.59% KG2E 8.85% 91.06% 89.57% 82.87%
NTN 11.17% 9.40% 10.24% NTN 10.80% 10.06% 10.20% 9.31%
ProjE 26.26% 92.09% 21.32% ProjE 88.73% 91.57% 92.55% 90.83%
RESCAL 92.74% 94.41% 93.30% RESCAL 92.97% 96.42%
RotatE 10.89% 97.21% 11.41% RotatE 87.15% 97.25% 96.46% 85.47%
SE 96.18% 97.39% 96.23% SE 97.11% 98.09% 96.60%
SimplE 78.49% 96.09% 69.32% SimplE 94.79% 92.88% 94.04% 95.30%
TransD 23.00% 43.30% 26.72% TransD 42.27% 43.25% 26.44% 23.60%
TransE 74.16% 91.29% 65.60% TransE 67.04% 85.57% 86.82% 68.11%
TransR 54.00% 68.62% 54.00% TransR 57.31% 70.48% 67.27% 65.41%
TuckER 97.67% 97.65% 97.30% TuckER 97.91% 98.32% 91.95% 97.49%
UM 9.78% 7.50% 9.08% UM 8.99% 5.17% 4.98% 8.43%
BCEL CEL SPL BCEL MRL NSSAL SPL
Fig. 27. Results for all configurations on Kinships based on Adadelta. BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss,
MRL to the margin ranking loss, NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed
world assumption training approach and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A11

training_approach = LCWA training_approach = sLCWA


ComplEx 94.93% 96.18% 94.04% ComplEx 97.30% 95.72% 96.88% 97.53%
ConvE 94.93% 94.65% 95.62% ConvE 96.32% 94.69% 94.23% 97.07%
ConvKB 94.69% 86.96% 92.23% ConvKB 81.01% 95.07% 96.09% 94.09%
DistMult 83.71% 88.59% 84.17% DistMult 86.78% 84.92% 86.87% 86.31%
ERMLP 90.88% 67.78% 88.92% ERMLP 93.06% 92.64% 94.41% 94.04%
HolE 87.29% 87.06% 87.38% HolE 85.99% 86.17% 85.34% 87.34%

inverse_relations = False
KG2E 9.87% 16.11% 15.69% KG2E 10.71% 88.45% 21.93% 14.15%
NTN 81.66% 88.27% 86.17% NTN 82.26% 86.17% 58.24% 88.59%
ProjE 91.90% 91.01% 87.15% ProjE 94.65% 91.53% 92.27% 94.09%
RESCAL 54.10% 95.95% 83.66% RESCAL 92.60% 91.06% 95.30% 97.58%
RotatE 98.23% 98.14% 90.13% RotatE 93.99% 98.74% 98.51% 93.62%
SE 93.58% 94.55% 94.27% SE 96.00% 96.74% 96.14% 96.74%
SimplE 95.53% 96.69% 92.09% SimplE 97.91% 95.62% 98.00% 97.67%
TransD 34.87% 40.27% 21.51% TransD 40.32% 40.69% 31.38% 35.99%
TransE 62.38% 79.10% 66.85% TransE 69.23% 77.98% 87.62% 67.04%
TransH 27.05% 45.39% 27.70% TransH 41.67% 46.55% 47.86% 38.45%
TransR 62.34% 64.99% 62.24% TransR 69.37% 70.34% 67.78% 71.55%
TuckER 97.72% 97.30% 97.25% TuckER 97.72% 97.77% 97.11%
UM 9.08% 9.22% 10.34% UM 9.78% 6.84% 9.03% 8.52%
BCEL CEL SPL BCEL MRL NSSAL SPL
ComplEx 96.88% 97.58% 96.09% ComplEx 97.91% 94.51% 97.30% 97.81%
ConvE 98.14% 97.72% 97.91% ConvE 94.93% 93.34% 97.35% 94.69%
ConvKB 94.74% 95.25% 95.62% ConvKB 95.07% 95.44% 96.97% 96.74%
DistMult 86.31% 92.27% 84.17% DistMult 86.78% 86.13% 84.82% 86.59%
ERMLP 93.85% 95.25% 94.74% ERMLP 96.74% 95.95% 96.88% 96.37%
HolE 88.27% 92.50% 88.22% HolE 87.29% 86.22% 86.78% 86.31%
11.55% 18.95% 13.83% KG2E 9.68% 89.76% 71.14% 56.52%

inverse_relations = True
KG2E
NTN 88.18% 88.92% 86.08% NTN 92.18% 87.85% 71.18% 20.44%
ProjE 94.74% 93.72% 91.85% ProjE 95.20% 94.37% 95.44% 95.16%
RESCAL 86.87% 96.83% 89.25% RESCAL 87.94% 89.43% 94.93% 97.07%
RotatE 95.07% 98.23% 92.23% RotatE 93.44% 98.65% 98.42% 93.39%
SE 96.42% 97.35% 96.09% SE 96.60% 96.55% 96.14% 96.55%
SimplE 98.09% 97.91% 93.72% SimplE 98.14% 94.79% 96.79% 97.77%
TransD 33.85% 45.58% 38.08% TransD 33.57% 40.88% 37.06% 32.45%
TransE 69.74% 86.36% 62.57% TransE 66.62% 85.85% 80.96% 67.83%
TransH 35.71% 52.37% 35.52% TransH 33.43% 46.97% 30.07% 34.12%
TransR 69.83% 70.86% 70.20% TransR 70.53% 70.44% 68.48% 70.02%
TuckER 98.51% 98.14% 98.46% TuckER 98.14% 94.74% 97.72% 98.09%
UM 8.71% 9.36% 10.75% UM 8.99% 6.28% 8.47% 10.20%
BCEL CEL SPL BCEL MRL NSSAL SPL
Fig. 28. Results for all configurations on Kinships based on Adam. BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss,
MRL to the margin ranking loss, NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed
world assumption training approach and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A12

training_approach = LCWA training_approach = sLCWA


ComplEx 45.02% 53.49% 41.21% ComplEx 43.98% 43.21% 42.29% 43.45%
ConvE 51.93% 49.88% 49.25% ConvE 12.18% 3.06% 14.74% 35.77%
ConvKB ConvKB 40.56% 38.89% 20.33% 39.24%
DistMult 42.15% 48.56% 45.19% DistMult 49.40% 50.19% 49.25% 45.83%
ERMLP 17.94% 26.74% 21.74% ERMLP 28.27% 28.25% 13.89% 24.74%
HolE 46.75% 50.79% 39.40% HolE 45.73% 40.99% 42.37% 45.84%

inverse_relations = False
KG2E 0.07% 42.68% 50.72% KG2E 0.00% 44.20% 46.02% 46.39%
NTN 1.54% NTN 1.04% 31.34% 0.14%
ProjE 44.48% 48.05% 36.56% ProjE 41.74% 40.53% 41.16% 42.13%
RESCAL 39.88% 53.04% 38.77% RESCAL 38.87% 35.98% 42.54% 40.94%
RotatE 58.14% 52.68% 57.78% RotatE 51.25% 51.20% 58.26% 51.28%
SE 10.64% SE 41.86% 45.38%
SimplE 45.37% 49.59% 38.17% SimplE 42.32% 38.37% 40.58% 42.41%
TransD 0.05% 14.83% 0.24% TransD 44.12% 46.31% 37.14% 42.73%
TransE 3.92% 33.94% 56.98% TransE 43.59% 49.15% 49.61% 42.80%
TransH 2.07% 13.56% 1.09% TransH 25.87% 47.81% 4.75% 38.10%
TransR 1.18% 14.45% 0.12% TransR 35.11% 42.53% 33.72% 35.69%
TuckER 49.97% 51.37% 48.77% TuckER 46.31% 45.09%
UM 44.70% 41.18% 45.04% UM 39.74% 39.33% 42.00% 39.79%
BCEL CEL SPL BCEL MRL NSSAL SPL
ComplEx 44.61% 14.71% 36.25% ComplEx 42.42% 42.00% 40.73% 41.81%
ConvE 53.47% 56.10% 53.13% ConvE 39.18% 0.72% 40.58% 31.12%
ConvKB ConvKB 40.49% 39.86% 46.87% 41.36%
DistMult 46.10% 52.96% 44.17% DistMult 50.19% 45.73% 48.56% 45.45%
ERMLP 42.60% 46.36% 41.46% ERMLP 47.52% 42.68% 45.33% 46.87%
HolE 44.96% 30.04% 33.04% HolE 47.42% 39.53% 42.53% 45.28%
0.02% 47.93% 50.39% KG2E 0.03% 42.41% 43.84% 45.64%

inverse_relations = True
KG2E
NTN 0.00% NTN 24.81% 23.85% 0.00%
ProjE 45.81% 51.74% 42.48% ProjE 42.80% 42.82% 43.45% 40.60%
RESCAL 35.98% 53.20% 40.92% RESCAL 38.92% 37.45% 41.79% 42.15%
RotatE 59.34% 54.96% 60.09% RotatE 49.15% 47.69% 57.93% 49.83%
SE 2.09% SE 41.42% 41.30%
SimplE 44.36% 50.92% 40.24% SimplE 41.38% 38.51% 39.26% 42.51%
TransD 0.10% 28.81% 0.07% TransD 44.65% 44.89% 33.16% 40.83%
TransE 56.40% 46.24% 23.89% TransE 44.29% 48.91% 50.02% 42.34%
TransH 1.42% 12.82% 1.37% TransH 34.32% 37.19% 4.74% 3.57%
TransR 0.07% 38.46% 0.14% TransR 32.85% 10.17% 28.54% 34.49%
TuckER 52.70% 54.26% 51.11% TuckER 50.67% 43.38%
UM 43.83% 42.15% 44.32% UM 39.69% 39.50% 41.48% 39.55%
BCEL CEL SPL BCEL MRL NSSAL SPL
Fig. 29. Results for all configurations on WN18RR based on Adam. BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss,
MRL to the margin ranking loss, NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed
world assumption training approach and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A13

training_approach = LCWA training_approach = sLCWA


ComplEx 26.80% 39.14% 21.48% ComplEx 35.46% 29.26% 29.65% 32.44%
DistMult 33.54% 39.68% 32.85% DistMult 31.89% 38.49% 33.25% 33.04%
ERMLP 27.35% 27.78% 25.59% ERMLP 34.98% 34.47% 28.62% 33.38%
HolE 37.18% 31.13% 32.44% HolE 35.25% 28.80% 30.36% 33.56%
KG2E 0.14% 21.87% 40.29% KG2E 3.79% 36.33% 35.76% 39.97%

inverse_relations = False
NTN 10.40% 18.78% NTN 2.65% 9.87% 0.05% 19.60%
ProjE 30.58% 28.58% 23.83% ProjE 28.36% 31.51% 25.00% 27.85%
RESCAL 26.36% 35.73% 20.35% RESCAL 24.64% 32.95% 31.24% 33.03%
RotatE 45.79% 41.80% 43.74% RotatE 35.98% 42.21% 48.06% 36.56%
SimplE 24.13% 34.73% 20.60% SimplE 27.48% 25.94% 29.64% 28.09%
TransD 3.72% 25.29% 0.72% TransD 27.71% 32.57% 24.18% 28.85%
TransE 42.95% 33.22% 43.72% TransE 29.37% 43.91% 40.66% 33.62%
TransH 16.15% 21.48% 19.96% TransH 26.55% 33.80% 25.17% 24.96%
TransR 10.15% 21.23% 9.82% TransR 34.65% 7.35%
TuckER 33.64% 40.71% 37.51% TuckER
UM 6.33% 8.00% 6.84% UM 6.01% 3.08% 5.59% 6.03%
BCEL CEL SPL BCEL MRL NSSAL SPL
ComplEx 32.88% 42.95% 27.31% ComplEx 30.78% 31.09% 28.26% 26.94%
DistMult 38.60% 46.03% 37.71% DistMult 31.51% 36.98% 33.72% 33.03%
ERMLP 43.10% 39.76% 38.52% ERMLP 33.90% 36.44% 30.52% 35.45%
HolE 37.05% 40.66% 36.28% HolE 33.62% 30.02% 30.54% 35.16%
KG2E 0.03% 36.59% 43.16% KG2E 0.05% 34.78% 32.54% 38.43%
16.17% 1.30% NTN 10.15% 3.01% 3.18% 4.23%

inverse_relations = True
NTN
ProjE 39.94% 40.15% 32.27% ProjE 24.01% 33.45% 23.84% 26.48%
RESCAL 43.73% 44.52% 31.92% RESCAL 27.59% 30.77% 25.73% 32.09%
RotatE 42.30% 45.71% 44.40% RotatE 35.54% 27.88% 47.74% 35.49%
SimplE 34.53% 38.67% 21.45% SimplE 25.27% 24.20% 25.62% 26.24%
TransD 2.58% 28.48% 2.80% TransD 24.22% 39.48% 24.81% 27.84%
TransE 38.18% 44.28% 41.36% TransE 29.01% 41.24% 38.61% 33.46%
TransH 26.47% 24.97% 26.81% TransH 24.15% 27.28% 22.23% 23.26%
TransR 16.06% 38.41% 12.02% TransR 32.26% 12.11%
TuckER 40.49% 43.61% 45.04% TuckER
UM 6.37% 7.88% 6.63% UM 5.89% 3.17% 5.67% 5.95%
BCEL CEL SPL BCEL MRL NSSAL SPL
Fig. 30. Results for all configurations on FB15K-237 based on Adam. BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss,
MRL to the margin ranking loss, NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed
world assumption training approach and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A14

training_approach = sLCWA
ComplEx 59.92% 51.90% 60.71% 62.22%

ConvKB 55.80% 52.64% 48.31% 52.65%

DistMult 54.76% 47.83% 36.35% 52.83%

inverse_relations = False
ERMLP 56.74% 39.91% 51.92% 58.06%

HolE 59.31% 36.98% 55.36%


RESCAL 41.71% 34.76% 53.98% 50.34%

RotatE 16.94% 55.21% 60.22% 17.73%

SE 5.16% 17.04% 22.81% 9.15%

TransD 15.25% 35.87% 6.46% 13.28%

TransE 14.31% 40.84% 41.83% 16.13%


BCEL MRL NSSAL SPL

ComplEx 62.52% 55.24% 60.24% 61.13%

ConvKB 56.36% 49.15% 46.06% 57.81%

DistMult 45.81% 54.91% 46.99% 50.15%


inverse_relations = True

ERMLP 59.44% 34.85% 53.68% 59.79%

HolE 53.56% 40.87% 44.75% 53.13%

RESCAL 3.63% 42.57% 39.98% 55.01%

RotatE 18.14% 56.44% 62.60% 17.33%

SE 9.94% 28.73% 29.00% 8.11%

TransD 14.12% 10.25% 0.82% 11.64%

TransE 15.18% 48.54% 37.15% 18.62%


BCEL MRL NSSAL SPL
Fig. 31. Results for all configurations on YAGO3-10 based on Adam. BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss,
MRL to the margin ranking loss, NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed
world assumption training approach and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A15

ComplEx ConvE ConvKB DistMult


BCEL

CEL

Training Approach
MRL LCWA
sLCWA

NSSAL

SPL

ERMLP HolE KG2E NTN


BCEL

CEL

MRL

NSSAL

SPL

ProjE RESCAL RotatE SimplE


BCEL

CEL

MRL

NSSAL

SPL

SE TransD TransE TransH


BCEL

CEL

MRL

NSSAL

SPL

0.0 0.2 0.4 0.6 0.8 1.0


hits@10
TransR TuckER UM
BCEL

CEL

MRL

NSSAL

SPL

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10 hits@10

Fig. 32. Impact of the training approach on the performance for a fixed interaction model and loss function for the Kinships dataset (results are
aggregated based on the medians). BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss, MRL to the margin ranking loss,
NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed world assumption training approach
and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A16

ConvE ComplEx ConvKB DistMult


Training Approach
BCEL LCWA
sLCWA
CEL

SPL

MRL

NSSAL

ERMLP HolE KG2E NTN


BCEL

CEL

SPL

MRL

NSSAL

ProjE RESCAL RotatE SimplE


BCEL

CEL

SPL

MRL

NSSAL

SE TransD TransE TransH


BCEL

CEL

SPL

MRL

NSSAL

0.0 0.2 0.4 0.6 0.8 1.0


hits@10
TransR TuckER UM
BCEL

CEL

SPL

MRL

NSSAL

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10 hits@10

Fig. 33. Impact of the training approach on the performance for a fixed interaction model and loss function for the WN18RR dataset (results are
aggregated based on the medians). BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss, MRL to the margin ranking loss,
NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed world assumption training approach
and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A17

ComplEx DistMult ERMLP HolE


Training Approach
BCEL LCWA
sLCWA
CEL

MRL

NSSAL

SPL

KG2E NTN ProjE RESCAL


BCEL

CEL

MRL

NSSAL

SPL

RotatE SimplE TransD TransE


BCEL

CEL

MRL

NSSAL

SPL

TransH TransR TuckER UM


BCEL

CEL

MRL

NSSAL

SPL

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10 hits@10 hits@10

Fig. 34. Impact of the training approach on the performance for a fixed interaction model and loss function for the FB15K-237 dataset (results are
aggregated based on the medians). BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss, MRL to the margin ranking loss,
NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed world assumption training approach
and sLCWA to the stochastic local closed world assumption training approach.

You might also like