Professional Documents
Culture Documents
Bringing Light Into The Dark A Large-Scale Evaluation of Knowledge Graph Embedding Models Under A Unified Framework
Bringing Light Into The Dark A Large-Scale Evaluation of Knowledge Graph Embedding Models Under A Unified Framework
Abstract—The heterogeneity in recently published knowledge graph e.g., question answering, dialogue systems, and recommen-
embedding models’ implementations, training, and evaluation has made dation systems [1]. Knowledge graph embedding models
fair and thorough comparisons difficult. To assess the reproducibility (KGEMs) present an avenue for predicting missing links.
of previously published results, we re-implemented and evaluated 19 However, the following two major challenges remain in
models in the PyKEEN software package. In this paper, we outline which
their application.
results could be reproduced with their reported hyper-parameters, which
could only be reproduced with alternate hyper-parameters, and which
First, the reproduction of previously reported results
could not be reproduced at all, as well as provide insight as to why this turned out to be a major challenge — there are even exam-
might be the case. ples of different results reported for the same combinations
We then performed a large-scale benchmarking on four datasets of KGEMs and datasets [2]. In some cases, the lack of
with several thousands of experiments and 21,246 GPU hours of com- availability of source code for KGEMs or the usage of dif-
putation time. We present insights gained as to best practices, best ferent frameworks and programming languages inevitably
configurations for each model, and where improvements could be made introduces variability. In other cases, the lack of a precise
over previously published best configurations. Our results highlight that specification of hyper-parameters introduces variability.
the combination of model architecture, training approach, loss function,
Second, the verification of the novelty of previously
and the explicit modeling of inverse relations is crucial for a model’s
performance and is not only determined by its architecture. We provide
reported results remains difficult. It is often difficult to
evidence that several architectures can obtain results competitive to the attribute the incremental improvements in performance re-
state of the art when configured carefully. We have made all code, exper- ported with each new state of the art model to the model’s
imental configurations, results, and analyses available at https://github. architecture itself or instead to the training approach, hyper-
com/pykeen/pykeen and https://github.com/pykeen/benchmarking. parameter values, or specific prepossessing steps, e.g., the
explicit modeling of inverse relations. It has been shown that
Index Terms—Knowledge Graph Embeddings, Link Prediction, Repro- baseline models can achieve competitive performance to
ducibility, Benchmarking more sophisticated ones when optimized appropriately [3],
[2]. Additionally, the variety of implementations and inter-
pretations of common evaluation metrics for link prediction
1 I NTRODUCTION
makes a fair comparison to previous results difficult [4].
3.1.0.7 RESCAL: RESCAL [20] is a bilinear model Therefore, for each triple (e1 , r, e2 ) ∈ K, both he1 and te2 as
that models entities as vectors and relations as matrices. well as he2 and te1 are updated [23].
The relation matrices Wr ∈ Rd×d contain weights wi,j that 3.1.0.12 TuckER: TuckER [24] is a linear model that
capture the amount of interaction between the i-th latent is based on the tensor factorization method Tucker [25] in
factor of h ∈ Rd and the j -th latent factor of t ∈ Rd [11], which a three-mode tensor X ∈ RI×J×K is decomposed
[20]. Thus, the plausibility score of (h, r, t) ∈ K is given by: into a set of factor matrices A ∈ RI×P , B ∈ RJ×Q , and
d X
d C ∈ RK×R and a core tensor Z ∈ RP ×Q×R (of lower rank):
(r)
X
f (h, r, t) = hT Wr t = wij hi tj (16) X ≈ Z ×1 A ×2 B ×3 C , (22)
i=1 j=1
3.1.0.8 DistMult: DistMult [5] is a simplification where ×n is the tensor product, with n denoting along
of RESCAL where the relation matrices Wr ∈ Rd×d are which mode the tensor product is computed. In TuckER,
restricted to diagonal matrices: a KG is considered as a binary tensor which is factorized
using the Tucker factorization where E = A = C ∈ Rne ×de
d
X denotes the entity embedding matrix, R = B ∈ Rnr ×dr
f (h, r, t) = hT Wr t = hi · diag(Wr )i · ti . (17)
represents the relation embedding matrix, and W = Z ∈
i=1
Rde ×dr ×de is the core tensor that indicates the extent of
Because of its restriction to diagonal matrices DistMult is interaction between the different factors. The interaction
computational more efficient than RESCAL, but at the same model is defined as:
time less expressive. For instance, it is not able to model
anti-symmetric relations, since f (h, r, t) = f (t, r, h). f (h, r, t) = W ×1 h ×2 r ×3 t , (23)
3.1.0.9 ComplEx: ComplEx [21] is an extension of
where h, t correspond to rows of E and r to a row of R.
DistMult that uses complex valued representations for the
3.1.0.13 ProjE: ProjE [26] is a neural network-based
entities and relations. Entities and relations are represented
approach with a combination and a projection layer. The
as vectors h, r, t ∈ Cd , and the plausibility score is computed
interaction model first combines h and r by following
using the Hadamard product:
combination operator [26]:
f (h, r, t) =Re(h r t)
h ⊗ r = De h + Dr r + bc , (24)
= hRe(h), Re(r), Re(t)i
+ hIm(h), Re(r), Im(t)i (18) where De , Dr ∈ Rk×k are diagonal matrices which are used
as shared parameters among all entities and relations, and
+ hRe(h), Re(r), Im(t)i
bc ∈ Rk represents the candidate bias vector shared across
− hIm(h), Im(r), Im(t)i , all entities. Next, the score for the triple (h, r, t) ∈ K is
where Re(x) and Im(x) denote the real and imaginary parts computed:
of the complex valued vector x. Because the Hadamard
f (h, r, t) = g(t z(h ⊗ r) + bp ) , (25)
product is not commutative in the complex space, ComplEx
can model anti-symmetric relations in contrast to DistMult. where g and z are activation functions, and bp represents
3.1.0.10 RotatE: RotatE [22] models relations as ro- the shared projection bias vector.
tations from head to tail entities in the complex space: 3.1.0.14 HolE: Holographic embeddings (HolE) [27]
make use of the circular correlation operator to compute
t=h r , (19)
interactions between latent features of entities and relations:
d
where h, r, t ∈ C and |ri | = 1, that is the complex elements
of r are restricted to have a modulus of one. Because of the f (h, r, t) = σ(rT (h ? t)) . (26)
latter, ri can be represented as eiθr,i , which corresponds to
a counterclockwise rotation by θr,i radians. The interaction where the circular correlation ? : Rd × Rd → Rd is defined
model is then defined as: as:
d−1
X
f (h, r, t) = −kh r − tk , (20) [a ? b]i = ak ∗ b(i+k) mod d (27)
k=0
which allows to model symmetry, antisymmetry, inversion,
and composition [22]. By using the correlation operator each component [h ? t]i
3.1.0.11 SimplE: SimplE [23] is an extension of represents a sum over a fixed partition over pairwise inter-
canonical polyadic (CP) [23], one of the early tensor factor- actions. This enables the model to put semantic similar in-
ization approaches. In CP, each entity e ∈ E is represented teractions into the same partition and share weights through
by two vectors he , te ∈ Rd and each relation by a single r. Similarly irrelevant interactions of features could also be
vector r ∈ Rd . Depending whether an entity participates in placed into the same partition which could be assigned a
a triple as the head or tail entity, either he or te is used. small weight in r.
Both entity representations are learned independently, i.e. 3.1.0.15 KG2E: KG2E [28] aims to explicitly model
observing a triple (e1 , r, e2 ), the method only updates he1 (un)certainties in entities and relations (e.g. influenced by
and te2 . In contrast to CP, SimplE introduces for each rela- the number of triples observed for these entities and re-
tion r the inverse relation r0 , and formulates the interaction lations). Therefore, entities and relations are represented
model based on both: by probability distributions, in particular by multi-variate
1 Gaussian distributions N i (µi , Σi ) where the mean µi ∈ Rd
hei , r, tej + hej , r0 , tei
f (h, r, t) = . (21) denotes the position in the vector space and the diagonal
2
ALI et al. 5
f (h, r, t) = wT g(W[h; r; t]), (30) where b ∈ R denotes a bias term and g an activation
function which is employed element-wise. Based on the
where W ∈ Rk×3d represents the weight matrix of the resulting feature maps v1 , . . . , vτ , the plausibility score of
hidden layer, w ∈ Rk , the weights of the output layer, a triple is given by:
and g denotes an activation function such as the hyperbolic
tangent. f (h, r, t) = [vi ; . . . ; vτ ] · w , (33)
3.1.0.17 Neural Tensor Network: The Neural Ten-
sor Network (NTN) [30] uses a bilinear tensor layer instead where [vi ; . . . ; vτ ] ∈ Rτ d×1 and w ∈ Rτ d×1 is a shared
of a standard linear neural network layer: weight vector. ConvKB may be seen as a restriction of ER-
MLP with a certain weight sharing pattern in the first layer.
f (h, r, t) = uTr · tanh(hWr t + Vr [h; t] + br ) , (31) 3.1.0.19 ConvE: ConvE [32] is a CNN-based ap-
proach. For each triple (h, r, t), the input to ConvE is a
where Wr ∈ Rd×d×k is the relation specific tensor, and the matrix A ∈ R2×d where the first row of A represents h ∈ Rd
weight matrix Vr ∈ Rk×2d , the bias vector br , and the and the second row represents r ∈ Rd . A is reshaped to a
weight vector ur ∈ Rk are the standard parameters of a matrix B ∈ Rm×n where the first m/2 half rows represent
neural network, which are also relation specific. The result h and the remaining m/2 half rows represent r. In the
of the tensor product hWr t is a vector x ∈ Rk where each convolution layer, a set of 2-dimensional convolutional filters
entry xi is computed based on the slice i of the tensor Wr : Ω = {ωi | ωi ∈ Rr×c } are applied on B that capture
xi = hWir t [30]. As indicated by the interaction model, NTN interactions between h and r. The resulting feature maps
defines for each relation a separate neural network which are reshaped and concatenated in order to create a feature
makes the model very expressive, but at the same time vector v ∈ R|Ω|rc . In the next step, v is mapped into the
computationally expensive. entity space using a linear transformation W ∈ R|Ω|rc×d ,
3.1.0.18 ConvKB: ConvKB [31] uses a convolu- that is eh,r = vT W. The score for the triple (h, r, t) ∈ K is
tional neural network (CNN) whose feature maps capture then given by:
global interactions of the input. Each triple (h, r, t) ∈ K f (h, r, t) = eh,r t . (34)
is represented as a input matrix A = [h; r; t] ∈ Rd×3 in
which the columns represent the embeddings for h, r and Since the interaction model can be decomposed into
t. In the convolution layer, a set of convolutional filters f (h, r, t) = hf 0 (h, r), ti, the model is particularly designed
ωi ∈ R1×3 , i = 1, . . . , τ, are applied on the input in order to 1-N scoring, i.e. efficient computation of scores for (h, r, t)
to compute for each dimension global interactions of the for fixed h, r and many different t.
ALI et al. 6
LCWA sLCWA
Anna
[
N = N (h, r, t) (35)
Car (h,r,t)∈K
DHL
Deutsche_Bank N (h, r, t) = T (h, r) ∪ H(r, t) (36)
Germany T (h, r) = {(h, r, t0 ) | t0 ∈ E ∧ t0 6= t} (37)
John
Peter H(r, t) = {(h0 , r, t) | h0 ∈ E ∧ h0 6= h} . (38)
Sarah
University_of_Oxford Theoretically, we would need to exclude all positive
triples from this set of candidates for negative triples, i.e.,
Peter
Peter
Germany
Germany
University_of_Oxford
University_of_Oxford
Sarah
Anna
Car
Sarah
DHL
John
Anna
Car
DHL
John
Deutsche_Bank
Deutsche_Bank
N − = N \ K. In practice, however, since usually |N | |K|,
the likelihood of generating a false negative is rather low.
Therefore, the additional filter step is often omitted to lower
computational cost. It should be taken into account that a
corrupted triple that is not part of the KG can represent a
true fact.
Fig. 3. Visualization of different training approaches for the relation
works_at in the KG in Figure 1. Red color indicates positive examples, UNS and BNS differ in the way they define sample
i.e. true triples present in the KG. Dark blue color denotes triples used weights for (h0 , r, t) or (h, r, t0 ):
as negative examples in LCWA. Light blue color sampling candidates for 3.2.2.1 Uniform negative sampling: With uniform
negative examples in sLCWA. Yellow color indicates triples that are not
considered.
negative sampling (UNS) [16], the first step is to randomly
(uniformly) determine whether h or t shall be corrupted for
a positive triple (h, r, t) ∈ K. Afterwards, an entity e ∈ E is
3.2 Training Approaches uniformly sampled and selected as the corrupted head/tail
Because most KGs are generated under the open world entity.
assumption (OWA), we require training approaches involv- 3.2.2.2 Bernoulli negative sampling: With
ing techniques such as negative sampling to avoid over- Bernoulli negative sampling (BNS) [17], the probability
generalization to true facts. Here, we describe two common of corrupting h or t in (h, r, t) ∈ K is determined by the
training approaches found in the literature: the local closed property of the relation r: if the relation is a one-to-many
world assumption (LCWA) and the stochastic local closed relation (e.g. motherOf ), BNS assigns a higher probability
world assumption (sLCWA). It should be noted that the to replace h, and if it is a many-to-one relation (e.g. bornIn)
LCWA and the sLCWA do not affect the evaluation. it assigns a higher probability to replace t. More precisely,
for each relation r ∈ R the average number of tails per
3.2.1 Local closed world assumption head (tph) and heads per tail (hpt) are first computed. These
The LCWA was introduced by [29] and used in subsequent statistics are then used to define a Bernoulli distribution
tph
works as an approach to generate negative examples during with parameter tph+hpt . For a triple (h, r, t) ∈ K the head
training [32], [24]. In this setting, for any triple (h, r, t) ∈ K is corrupted with probability tph+hpt tph
and the tail with
that has been observed, a set T − (h, r) of negative exam- hpt
probability tph+hpt . The described approach reduces the
ples is created by considering all triples (h, r, ti ) ∈ / K as
chance of creating corrupted triples that represent true
false. Therefore, for our exemplary KG (Figure 1) for the
facts [17].
pair (Peter, works at), the triple (Peter, works at, DHL) is a
false fact since for this pair only the triple (Peter, works at,
Deutsche Bank) is part of the KG. Similarly, we can construct 3.3 Loss Functions
H− (r, t) based on all triples (hi , r, t) ∈
/ K, or R− (h, t) based
The loss function can have a significant influence on the
on the triples (h, ri , t) ∈/ K. Constructing R− (h, t) is a
performance of KGEMs [7]. In the following, we describe
popular choice in visual relation detection domain [33], [34].
pointwise, pairwise, and setwise loss functions that have been
However, most of the works in knowledge graph modeling
frequently be used within KGEMs. For additional discussion
construct only T − (h, r) as the set of negative examples, and
and a slightly different categorization we refer to the work
in the context of this work refer to T − (h, r) as the set of
of Mohamed et al. [7].
negatives examples when speaking about LCWA.
3.2.2 Stochastic local closed world assumption 3.3.1 Pointwise Loss Functions
Under the stochastic local closed world assumption Let f denote the interaction model of a KGEM. With ti ,
(sLCWA), instead of considering all possible triples we denote a triple (i.e. ti ∈ K), and with li ∈ {0, 1} or
(h, r, ti ) ∈
/ K, (hi , r, t) ∈
/ K or (h, ri , t) ∈
/ K as false, we ˆli ∈ {−1, 1} its corresponding label, where 1 corresponds
randomly take samples of these sets. to the label of the positive triples, and 0 / -1 to the label
Two common approaches for generating negative sam- of the negative triples. Pointwise loss functions compute an
ples are uniform negative sampling (UNS) [16] and independent loss term for each triple-label pair, i.e. for a
Bernoulli negative sampling (BNS) [17] in which negative |B|
batch B = {(ti , li )}i=1 , the loss is given as
triples are created by corrupting a positive triple (h, r, t) ∈
K by replacing either h or t. We denote with N the set of all 1 X
L= L(ti , li ) (39)
potential negative triples: |B|
(ti ,li )∈B
ALI et al. 7
In the following, we describe four different pointwise losses: are already predicted well enough (according to the margin
The square error loss, binary cross entropy loss (BCEL), pointwise parameter λ).
hinge loss, and logistic loss.
3.3.1.1 Square Error Loss: The square error loss 3.3.2 Pairwise Loss Functions
function computes the squared difference between the pre- Next, we describe widely applied pairwise loss functions
dicted scores and the labels li ∈ {0, 1} [7]: that are used within KGEMs, namely the pairwise hinge loss
1 and the pairwise logistic loss. They both compare the scores
(f (ti ) − li )2
L(ti , li ) = (40) of a positive triple t+ and a negative triple t− . The negative
2
triple in a pair is usually obtained by corrupting the positive
The squared error loss strongly penalizes predictions that
one. Thus, the pairs often share common head or tail entities
deviate considerably from the labels, and is usually used − |B|
and relations. For a batch of pairs B = {(t+ i , ti )}i=1 , the
for regression problems. For simple models it often permits
loss is given as
more efficient optimization algorithms involving analyti-
cal solutions of sub-problems, e.g. the Alternating Least 1 X
L= L(f (t− +
i ) − f (ti )) . (44)
Squares algorithm used by [20]. |B| + −
(ti ,ti )∈B
3.3.1.2 Binary cross entropy loss: The binary cross
entropy loss is defined as [32]: Hence, the loss function evaluates the difference in scores
L(ti , li ) = − (li · log(σ(f (ti ))) ∆ = f (t− +
i ) − f (ti ) between a positive and a negative triple,
(41) rather than their absolute scores. This is in accordance to
+ (1 − li ) · log(1 − σ(f (ti )))), the OWA assumption, where we do not assume to have
where li ∈ {0, 1} and σ represents the logistic sigmoid func- negative labels, but just ”less positive” ones.
tion. Thus, the problem is framed as a binary classification 3.3.2.1 Pairwise Hinge Loss/Margin ranking loss:
problem of triples, where the model’s outputs are regarded The pairwise hinge loss or margin ranking loss (MRL) is
as logits. The loss is not well-suited for translational distance given by
models because these models produce a negative distance L(∆) = max(0, λ + ∆) . (45)
as score and cannot produce positive model outputs. ConvE
3.3.2.2 Pairwise Logistic Loss: The pairwise logistic
and TuckER were originally trained in a multi-class setting
loss is defined as [7]:
using the binary cross entropy loss where each (h, r)-pair
has been classified against e ∈ E simultaneously, i.e., if L(∆) = log(1 + exp(∆)) . (46)
|E| = n, the label vector for each (h, r)-pair has n entries
indicating whether the triple (h, r, ei ) is (not) part of the Thus, it can be seen as a soft-margin formulation of the
KG, and along each dimension of the label vector a binary pairwise hinge loss with a margin of zero.
classification is performed. It should be noted that there
exist different implementation variants of the binary cross 3.3.3 Setwise Loss Functions
entropy loss that address numerical stability. ConvE and Setwise loss functions neither compare individual scores,
TuckER employed a numerically unstable variant, and in the or pairs of them, but rather more than two triples’ scores.
context of this work, we refer to this variant when referring Here, we describe the self-adversarial negative sampling
to the binary cross entropy loss. loss (NSSAL) and the cross entropy loss (CEL) as exam-
3.3.1.3 Pointwise Logistic Loss/Softplus loss: An ples of such loss functions that has been applied within
alternative, but equivalent formulation of the binary cross KGEMs [22], [7].
entropy loss is the pointwise logistic loss (or Softplus loss 3.3.3.1 Self-adversarial negative sampling loss: In-
(SPL)): stead of treating each negative sample equally in the
L(ti , li ) = log(1 + exp(−ˆli · f (ti )) (42) loss function, the self-adversarial negative sampling loss
(NSSAL) weights negative samples according to the pre-
where ˆ li ∈ {−1, 1} [7]. It has been used to train ComplEx, dicted scores by the interaction model [22]. Let (h, r, t) ∈ K
ConvKB, and SimplE. We consider both variants separately denote a true triple, and {(h0i , r, t0i )}K
i=1 the set of negative
because both have been used in different model imple- samples generated for it. Then, we define the following
mentations, and their implementation details might yield distribution over the negative samples
different results (e.g., to numerical stability).
3.3.1.4 Pointwise Hinge Loss: The pointwise hinge exp(αf (h0i , r, t0i ))
p((h0i , r, t0i )) = Pn 0 0 , (47)
loss sets the score of positive examples larger than a margin j=1 exp(αf (hj , r, tj ))
parameter λ while reducing the scores of negative examples where α ∈ R denotes a temperature parameter. The proba-
to values below −λ: bilities of the negative triples are then used as weights for
L(ti , li ) = max(0, λ − ˆli · f (ti )) (43) them to compute the loss:
where ˆ li ∈ {−1, 1}. The loss penalizes scores of positive L = − log(σ(γ + f (h, r, t)))
examples which are smaller than λ, but does not impose any K
X (48)
restriction on values > λ. Similarly, negative scores larger − p((h0 , r, t0 )) · log(σ(−(γ + f (h0i , r, t0i )))) .
than −λ contribute to the loss, whereas all values smaller i=1
than −λ do not have any loss contribution [7]. Thereby, the Thus, negative samples for which the model predicts a high
model is not encouraged to further optimize triples which score relative to other samples are weighted stronger.
ALI et al. 8
3.3.3.2 Cross entropy loss: The cross entropy loss For each t+ and its corresponding corrupted triples, the
(CEL) has been successfully applied together with 1-N scores are computed and the entities sorted accordingly.
scoring, i.e., predicting for each (h, r)-pair simultaneously a Next, the rank of every t+ among its corrupted triples is
score for each possible tail entity, and framing the problem determined, i.e. the position in the score-sorted list.
as a multi-class classification problem [3], [8]. To apply the Among the corrupted triples in H(r, t) / T (h, r), there
CEL, first, the labels are normalized in order to form a might be true triples that are part of the KG. If these false
proper probability distribution. Second, the predicted scores negatives are ranked higher than the current test triple
for the tail entities of (h, r)-pair are normalized by a soft- t+ , the results might get distorted. Therefore, the filtered
max: evaluation setting has been proposed [16], in which the
exp(f (h, r, t)) corrupted triples are filtered to exclude known true facts
p(t | h, r) = P . (49)
exp(f (h, r, t0 )) from the train and test set. Thus, the rank does not decrease
t0 ∈E
when ranking another true entity higher.
Finally, the cross entropy between the distribution of the Moreover, we want to draw attention to the fact that
normalized scores and the normalized label distribution is the metrics can be further be distorted by unknown false
computed: negatives, i.e., true triples that are contained in the set of
X corrupted triples but are not part of the KG (and therefore
L=− I[(h, r, t0 ) ∈ K] · log(p(t | h, r)) , (50) cannot be filtered out). Therefore, it is essential to investigate
t0 ∈E
the predicted scores of a KGEM and not solely rely on the
where I denotes the indicator function. Note that this loss computed metrics.
differs from the multi-class binary cross entropy as it applies Based upon these individual ranks, the following mea-
a softmax normalization implying that this is a single-label sures are frequently used to summarize the overall perfor-
multi-class problem. mance:
4.0.0.1 Mean rank: The mean rank (MR) represents
3.4 Explicitly Modeling Inverse Relations the average rank of the test triples, i.e.
Inverse relations introduced by [23] and [35] are explicitly 1 X
MR = rank(t) (51)
modeled by extending the set of relations R by a set of |Ktest | t∈K
test
inverse relations rinv ∈ Rinv with Rinv ∩ R = ∅. This
is achieved by training an inverse triple (t, rinv , h) for Smaller values indicate better performance.
each triple (h, r, t) ∈ K. Equipping a KGEM with inverse 4.0.0.2 Adjusted mean rank: Because the interpre-
relations implicitly doubles the relation embedding space of tation of the MR depends on the number of available
any model that has relation embeddings. The goal is to alter candidate triples, comparing MRs across different datasets
the scoring function, such that the task of predicting the (or inclusion of inverse triples) is difficult. This is some-
head entities for (r, t) pairs becomes the task of predicting times further exacerbated in the filtered setting because
tail entities for (t, rinv ) pairs. The explicit training of the the number of candidates varies. Therefore, with fewer
implicitly known inverse relations can lead to better model candidates available, it becomes easier to achieve low ranks.
performance [35] and can for some models increase the The adjusted mean rank (AMR) [10] compensates for this
computational efficiency [32]. problem by comparing the mean rank against the expected
mean rank under a model with random scores:
MR
4 E VALUATION M ETRICS FOR KGEM S AMR = 1 P (52)
2 (ξ(t) + 1)
KGEMs are usually evaluated based on link prediction, t∈Ktest
which is on KG defined as predicting the tail/head entities where ξ(t) denotes the number of candidate triples against
for (h, r)/(r, t) pairs. For instance, given queries of the which the true triple t ∈ Ktest is ranked. In the unfiltered
form (Sarah, studied at, ?) or (?, CEO of, Deutsche Bank) the setting we have ξ(t) = |E| − 1 for all t ∈ Ktest . Thereby,
capability of a link predictor to predict the correct entities the measure also adjusts for chance, as a random scoring
that answer the query, i.e. (Sarah, studied at, University of achieves an expected adjusted mean rank of 1. The AMR
Oxford) and (Sarah, CEO of, Deutsche Bank) is measured. has a fixed value range from 0 to 1, where smaller values
However, given the fact that usually true negative ex- (AMR 1) indicate better performance.
amples are not available, both the training and the test 4.0.0.3 Mean reciprocal rank: The mean reciprocal
set contain only true facts. For this reason, the evaluation rank (MRR) is defined as:
procedure is defined as a ranking task in which the capa-
bility of the model to differentiate corrupted triples from 1 X 1
MRR = (53)
known true triples is assessed [16]. For each test triple |Ktest | t∈K rank(t)
test
t+ = (h, r, t) ∈ Ktest two sets of corrupted triples are
where Ktest is a set of test triples, i.e. the MRR is the
constructed:
mean over reciprocal individual ranks. However, the MRR
1) H(r, t) = {(h0 , r, t) | h0 ∈ E −{h} which contains all is flawed since the reciprocal rank is an ordinal scale and
the triples where the head entity has been corrupted, not an interval scale, i.e. computing the arithmetic mean is
and statistically incorrect [36], [37]. Still, it is often used for early
2) T (h, r) = {(h, r, t0 ) | t0 ∈ E − {t}} that contains all stopping since it is a smooth measure with stronger weight
the triples with corrupted tail entity. on small ranks, and less affected by outlier individual ranks
ALI et al. 9
than the mean rank. The MRR has a fixed value range from TABLE 2
0 to 1, where larger values indicate better performance. Existing Benchmark Datasets.
4.0.0.4 Hits@K: Hits@K denotes the ratio of the test
triples that have been ranked among the top k triples, i.e., Dataset Triples Entities Relations
FB15K 592,213 14.951 1,345
|{t ∈ Ktest | rank(t) ≤ k}| FB15K-237 272,115 14,541 237
Hits@k = (54)
|Ktest | WN18 151,442 40,943 18
WN18RR 93,003 40,943 11
Larger values indicate better performance. Kinships 10,686 104 26
4.0.0.5 Additional Metrics: Further metrics that Nations 11,191 14 56
UMLS 893,025 135 49
might be relevant are the area under the Receiver Operating YAGO3-10 1,079,40 132,182 37
Characteristic curve (AUC-ROC) and the area under the
precision-recall curve (AUC-PR) [11]. However, these met-
rics require the number of true positives, false positives, true 5.0.0.6 Nations: The Nations [40] dataset contains
negatives, and false negatives, which in most cases cannot data about countries and their relationships with other
be computed since the KGs are usually incomplete. countries. Exemplary relations are economic aid and accusa-
tion [14].
5 E XISTING B ENCHMARK DATASETS 5.0.0.7 Unified Medical Language System [41]: The
Unified Medical Language System (UMLS) [41] is an ontol-
In this section, we describe the benchmark datasets that
ogy that describes relationships between high-level concepts
have been established to evaluate KGEMs. A summary is
in the biomedical domain. Examples of contained concepts
also given in Table 2.
are Cell, Tissue, and Disease, and exemplary relations are
5.0.0.1 FB15K: Freebase is a large cross-domain KG
part of and exhibits [14], [41].
consisting of around 1.2 billion triples and more than 80
5.0.0.8 YAGO3-10: Yet Another Great Ontology
million entities. Bordes et al. [16] extracted a subset of
(YAGO) [42] is a KG containing facts that have been ex-
Freebase, which is used as a benchmark dataset and named
tracted from Wikipedia and aligned with WordNet in or-
it FB15K. It contains 14,951 entities, 1,345 relations, as well
der to exploit the large amount of information contained
as more than half a million triples describing facts about
in Wikipedia and the taxonomic information included in
movies, actors, awards, sports, and sports teams [32].
WordNet. It contains general facts about public figures,
5.0.0.2 FB15K-237: FB15K has a test-leakage, i.e.
geographical entities, movies, and further entities, and it
a major part of the test triples (∼81%) are inverses of
has a taxonomy for those concepts. YAGO3-10 is a subset of
triples contained in the training set: for most of the test
YAGO3 [43] (which is an extension of YAGO) that contains
triples of the form (h, r, t), there exists a triple (h, r0 , t)
entities associated with at least ten different relations. In
or (t, r0 , h) in the training set. Therefore, Toutanova and
total, YAGO3-10 has 123,182 entities and 37 relations, and
Chen [38] constructed FB15K-237 in which inverse relations
most of the triples describe attributes of persons such as
were removed [38]. FB15K-237 contains 14,541 entities and
citizenship, gender, and profession [32].
237 relations.
5.0.0.3 WN18: WordNet1 is a lexical knowledge
base in which entities represent terms and are called synsets. 6 R EPRODUCIBILITY S TUDIES
Relations in WordNet represent conceptual-semantic and
lexical relationships (e.g. hyponym). Bordes et al. [14] ex- The goal of the reproducibility studies was to investigate
tracted a subset of WordNet named WN18 that is frequently whether it is possible to replicate experiments based on
used to evaluate KGEMs. It contains 40,943 synsets and 18 the information provided in each model’s accompanying
relations. paper. If specific information was missing, such as the
5.0.0.4 WN18RR: Similarly to FB15K, WN18 also number of training epochs, we tried to find this informa-
has a test-leakage (of approximately 94%) [38]. For instance, tion in the accompanying source code if it was accessible.
for most of the test triples of the form (h, hyponym, t), For our study, we focused on the two most frequently
there exists a triple (t, hypernym, o) in the training set. used benchmark datasets, FB15K and WN18, as well as
Dettmers et al. [32] have shown that a simple rule-based their respective subsets FB15K-237 and WN18RR. Table 3
system can obtain results competitive to the state of the art illustrates for which models results were reported (in the
results on WN18. For this reason, they constructed WN18RR accompanying publications) for the considered datasets. A
by removing inverse relations similarly to the procedure checkmark denotes that results were reported, and green
applied to FB15K. WN18RR contains 40,943 entities and 11 background indicates that the entire experimental setup for
relations. the corresponding dataset was described. Results have not
5.0.0.5 Kinships: The Kinships [39] dataset de- been reported for every model for every dataset because
scribes relationships between members of the Australian some of the benchmark datasets were created after the
tribe Alyawarra and consists of 10,686 triples. It contains 104 models were published. Therefore, these models have been
entities representing members of the tribe and 26 relation- excluded from our reproducibility study.
ship types that represent kinship terms such as Adiadya or 6.0.0.1 Experimental Setup: For each KGEM, we
Umbaidya [14]. applied identical training and evaluation settings as de-
scribed in their concomitant papers. We ran each experiment
1. https://wordnet.princeton.edu/ four times with random seeds to measure the variance in
ALI et al. 10
same score for many triples, there will be a large discrepancy TABLE 4
between the three ranking approaches. We could observe Hyper-Parameter Ranges for Ablation Experiments
such a discrepancy for SimplE for which the results on
FB15K (Table 11) and WN18 (Table 13) were almost 0% Hyper-Parameter Range
based on the average ranking approach, but were much Embedding-Dimension {64,128,256}
higher based on the optimistic ranking approach. Similar Initialization {Xavier}
Optimizersa {Adam, Adadelta}
Shared
observations for other KGEM have been made in [4].
Learning Rate (log scale) [0.001, 0.1)
Batch Sizeb {128, 256, 512}
Model inverse relations {Yes, No}
7 B ENCHMARKING Epochs 1,000
In our benchmarking studies, we evaluated a large set Loss {BCEL, MRL, NSSAL, SPL}
sLCWA
of different combinations of interaction models, training Margin for MRL {0.5, 1.5, ... , 9.5}
approaches, loss functions, and the effect of explicitly mod- Margin for NSSAL {1, 3, 5, ... , 29}
ADVT for NSSAL {0.1, 0.2, ... , 1.0}
eling inverse relations. In particular, we investigated 19 Number of Negativesc {1, 2, ... , 100}
interaction models, two training approaches, and five loss
LCWA
Loss {BCEL, CEL, SPL}
functions on four datasets. We refer to a specific combination Label Smoothing (log scale) [0.001, 1.0)
of interaction model, training approach, loss function, and
whether inverse relations are explicitly modeled as a configu- a
For Kinships, we evaluated Adam and Adadelta, and for the
ration, e.g., RotatE + LCWA + SPL + inverse relations. We do remaining datasets we sticked to Adam since it performed
not refer to different hyper-parameter values such as batch almost in every experiment at least equally good as Adadelta
and in many experiments significantly better.
size or learning rate when we use the term configuration. b
For YAGO3-10, the batch-size has been sampled from the set
For each configuration, we used random search to perform {1024, 2048, 2096, 8192}.
c
the hyper-parameter optimizations over all other hyper- For YAGO3-10, the number of negative triples per each each
parameters and applied early stopping on the validation set. positive has been sampled from the set {1, 2, ..., 50}.
Each hyper-parameter optimization experiment lasted for a
maximum of 24 hours or 100 iterations, in which new hyper-
parameters have been sampled in each iteration. Overall, extent of our benchmarking study and the fact that YAGO3-
we performed individual hyper-parameter optimizations 10 contains more than 132,000 entities, which makes the
for more than 1,000 configurations. We retrain the model training based on the LCWA with 1-n scoring expensive, we
with the best hyper-parameter setting and report evaluation restricted the training approach to the sLCWA for YAGO3-
results on the test set. 10.
Before presenting our results, we provide an overview of 7.0.0.4 Loss Functions: We investigated MRL,
the experimental setup, comprising the investigated inter- BCEL, SPL, NSSAL, and CEL since they represent the vari-
action models, training approaches, loss functions, negative ety of types described in Section 3.3 and because they have
samplers, and datasets. We used the sLCWA and LCWA been previously shown to yield good results. MRL has not
as training approaches. For the sLCWA we applied a 1:k- been historically used in the 1-N scoring setting likely due
Scoring as usually done throughout the literature [16], [21], to the fact that in 1-N scoring, the number of positive and
where k denotes the number of negative examples for each negative scores in each batch is not known in advance and
positive. For the LCWA, we applied a 1:N-Scoring, i.e., dynamic. Thus, the number of possible pairs varies as well
we sample each batch against all negatives examples as ranging from N − 1 to (N/2)2 for each (h, r) combination.
typically done for training with the LCWA [32]. Table 4 The accompanying variance in memory requirements for
shows the hyper-parameter ranges for the sLCWA and the each batch thus poses practical challenges. Therefore, we
LCWA assumptions. did not use the MRL in combination with the 1-N scoring
7.0.0.1 Datasets: We performed experiments on the setting.
following four datasets: WN18RR, FB15K-237, Kinships and 7.0.0.5 Negative Sampler: When using the sLCWA,
YAGO3-10. We selected WN18RR and FB15K-237 since we generated negative samples with UNS. When training
they are widely applied benchmarking datasets. We chose with the LCWA and 1-N scoring, no explicit negative sam-
Kinships and YAGO3-10 to investigate the performance of pling was required.
KGEMs on a small and a larger dataset. 7.0.0.6 Early Stopping: We evaluated each model
7.0.0.2 Interaction Models: We investigated all in- every 50 epochs and performed early stopping with a
teraction models described in Section 3.1. Because of our patience of 100 epochs on all datasets except for YAGO3-
vast experimental setup and the size of YAGO3-10, we 10. There, considering the larger number of triples seen in
restricted the number of interaction models on YAGO3-10 each epoch we evaluated each model every 10 epochs and
as otherwise, the computational effort would be prohibitive. performed early stopping with a patience of 50 epochs.
Based on their variety of model types as described in Sec- Below, we describe the results of our benchmarking
tion 3.1, we selected the following ten interaction models: study. In the four following subsections, we summarize the
ComplEx, ConvKB, DistMult, ERMLP, HolE, RESCAL, Ro- results for each dataset (i.e., Kinships, WN18RR, FB15K-
tatE, SE, TransD, and TransE. 237, YAGO3-10) along with a discussion of the effect of the
7.0.0.3 Training Approaches: We trained the in- models’ individual components (i.e., training approaches,
teraction models based on the sLCWA (Section 3.2.2) and loss functions, the explicit modeling of inverse relations)
the LCWA (Section 3.2.1) training approaches. Due of the and optimizers on the performance. Finally, we compare the
ALI et al. 12
model complexity versus performance. In the appendix, we (e.g., ComplEx, ConvKB) obtain high results across all loss
provide further results. In particular, we provide for each functions, and others such as ConvE and KGE2E suffer for
model the results of all tested combinations of interaction some of the loss functions.
model, training approach, and loss function. 7.1.0.4 Impact of Explicitly Modeling Inverse Re-
lations: Figures 7-9 present the effect of explicitly modeling
inverse relations. It can be observed that in general, the
7.1 Results on the Kinships Dataset
LCWA benefits from explicit usage of inverse relations in
Investigating the model performances on Kinhsips is in- terms of robustness, but sLCWA based configurations can
teresting because it is a comparatively small KG and thus lead to similar results.
permits for each configuration a large number of HPO The loss functions that are trained under the LCWA
iterations for all interaction models. Figure 4 provides a are more robust when trained with inverse relations. Loss
general overview of the results, i.e., performance of the functions that have been trained based on the sLCWA
interaction models, loss functions, training approach, the exhibit in all cases except for NSSAL a lower variance when
effect of modeling inverse relations, and the effect of the op- inverse relations are not explicitly modeled (Figure 8). This
timizers. Overall, several interaction models performed very observation highlights that the modeling of inverse relations
well (hits@10 higher than 90%), some performed poorly (less is closely related to the applied training approach.
than 80% hits@10), and the remaining interaction models re- Considering the interaction models (Figure 9), it can be
vealed average performance (80%-90% hits@10; see Figure 4 seen that for some combinations of interaction models and
as well as Figure 28 and Figure 27 that can be found in the loss functions, the impact of explicitly modeling inverse
appendix). relations is significant. For instance, when trained based
The key observations are that the interaction models on CEL, ERMLP performs significantly better when trained
TransD, TransH, TransR, and UM performed poorly (less with explicit inverse relations, whereas ConvE suffers from
than 80%). UM significantly under-performed, which is not explicit inverse relations when trained with the MR loss.
surprising since it does not consider the multi-relational Here, we want to highlight the close connection to the train-
information of the data. For a dataset with similar struc- ing approaches. Therefore, it is important to consider these
tural properties such as Kinships, the interaction models results in the context of the results depicted in Figures 5-7.
ComplEx and SimplE seem to be excellent choices since 7.1.0.5 Model Complexity versus Performance:
they do not have a very high model complexity, obtain Figure 26 (Appendix) plots the model size against the ob-
good performances, and are robust throughout a large set tained performance. The results highlight that there is no
of different configurations. In the following, we examine the strong correlation between model size and performance, i.e.,
effect of the single components that define a KGEM. models with a small number of parameters can perform
7.1.0.1 Impact of the Optimizer: On Kinships, we equally well as large models on the Kinships data set. The
evaluated Adam and Adadelta as optimizers. Our results skyline comprises small UM models, some intermediate
highlight that in most cases, Adam outperforms Adadelta HolE and ProjE models, and larger RotatE and TuckER
(in many cases with high margin) or at least performs models. A full list is provided in Table 17 (in Appendix).
equally well (Figure 4). One of the few interaction models
that performed very well across all configurations that have
been trained with Adadelta is ConvE (Figure 27 in the 7.2 Results on the WN18RR Dataset
appendix). Figure 10 depicts the overall results over WN18RR. A de-
Since Adam performed better for the majority of the tailed overview over all configurations can be found in Fig-
configurations and the fact that there was always at least ure 29 in the appendix. The results highlight that there are
one Adam-based configuration that performed equally well several combinations of interaction models, loss functions,
as the best Adadelta configuration, we decided to progress and training approaches that obtain hits@10 results that
only with Adam for the remaining datasets in order to are competitive with state-of-the-art results3 . In particular,
reduce the computational costs. ComplEx (53.49%), ConvE (56.10% compared to 52.00% in
7.1.0.2 Impact of the Training approach: Figure 5 the original paper [32] ), DistMult (52.96%), ProjE (51,74%),
and Figure 6 depict the effect of the training approaches. The TransE (56.98%), RESCAL (53.64%), RotatE (60.09% com-
former summarizes the results over the interaction models, pared to 56.61% in the original paper [22]), SimplE (50.92%),
and the latter differentiates between them. At this point, we and TuckER (54.26% compared to 52.6% in the original
focus only on the BCEL and the SPL since they have been paper [24]) obtained high performance. Especially the result
trained with both training approaches. The key observation obtained by TransE is impressive, since it beats most of the
is that both training approaches perform comparably well published state-of-the-art results. The results highlight that
for the BCEL and SPL functions. However, BCEL reveals a determining an appropriate configuration is fundamental
slightly less variance when trained in combination with the since many interaction models such as ConvE and TransE
sLCWA, and the SPL varies less when trained based on the reveal a high variance across different configurations.
LCWA. Besides, the results show that with an appropriate con-
7.1.0.3 Impact of the Loss Function: Figure 5 high- figuration, the model complexity can be significantly re-
lights that all five loss functions achieve high performance, duced. For instance, for RotatE, several high performing
but at the same time, most of them exhibit high variance. configurations have been found (Figure 29 in the Appendix),
An exception is the MRL, which is comparatively robust.
Figure 32 (Appendix) reveals that some interaction models 3. https://paperswithcode.com/sota/link-prediction-on-wn18rr
ALI et al. 13
1.0
Model 1.0
Loss / Training Approach
0.8 0.8
0.6 0.6
hits@10
hits@10
0.4 0.4
0.2 0.2
0.0 0.0
Tra sD
Tra nsE
TransH
Tuc nsR
kER
a)
a)
a)
)
Co nvE
Dis vKB
ER Mult
HoLP
KG lE
NT E
Pro N
RE jE
RoCAL
tatE
Sim E
UM
AL
CoplEx
Tra plE
wa
CE
MR
2
(lcw
(ow
cw
n
M
L (o
n
t
NS
S
m
L (l
EL
Co
EL
SP
SP
BC
BC
1.0
Optimizer 1.0
Inverse Relations
0.8 0.8
0.6 0.6
hits@10
hits@10
0.4 0.4
0.2 0.2
0.0 0.0
e
lta
am
se
Tru
ade
Fal
Ad
Ad
Fig. 4. Overall hits@10 results for Kinships where box-plots summarize the best results across different configurations, i.e., combinations of
interaction models, training approaches, loss functions, and the explicit usage of inverse relations.
BCEL
LCWA
CEL
Inverse Relations
False
Training Approach True
MRL LCWA
sLCWA
sLCWA
NSSAL
1.0
Model Loss / Training Approach Inverse Relations
0.8
0.6
hits@10 0.4
0.2
0.0
Tra sD
Tra sE
nsH
Tuc sR
kER
e
a)
a)
a)
)
Co vE
Dis KB
ER ult
Ho P
lE
2E
N
RE jE
Ro AL
tatE
SE
UM
e
Co lEx
plE
wa
Tru
CE
MR
SA
s
ML
NT
(lcw
(ow
(lcw
Pro
tM
Fal
n
n
n
KG
SC
nv
mp
Sim
L (o
Tra
Tra
NS
EL
Co
EL
SP
SP
BC
BC
Fig. 10. Overall hits@10 results for WN18RR where box-plots summarize the results across different combinations of interaction models, training
approaches, loss functions, and the explicit usage of inverse relations.
CEL
MRL
sLCWA
NSSAL
Fig. 11. Impact of the training approach on the performance for a fixed
loss function for the WN18RR dataset.
BCEL (LCWA)
Fig. 12. Impact of training approach on the performance for a fixed Fig. 14. Impact of explicitly modeling inverse relations on the perfor-
interaction model and loss function for the WN18RR dataset. mance for a fixed loss function for the WN18RR dataset.
7.3.0.1 Impact of the Training Approach: Figure 17 7.3.0.2 Impact of the Loss Function: Figure 16
shows that for both BCEL and SPL (we focus here only on illustrates that for all loss functions, comparable results
these two loss functions since they have been trained with can be achieved, but they exhibit higher variance when
both training approaches), the LCWA obtains significantly trained together with the LCWA. However, some interaction
higher results, but they express a high variance at the models seem to be more sensitive to the usage of the loss
same time. Figures 18 and 34 (Appendix) illustrate that function. For instance, ComplEx, and RESCAL performed
some interaction models benefit from one of the training better when trained based on the CEL (Figure 34), whereas
approaches, whereas others show similar performance with translation distance models TransH, TransD and TransR
both. For instance, it can be observed that RotatE and TransE benefit from the MRL.
perform better in combination with the LCWA approach for 7.3.0.3 Impact of Explicitly Modeling Inverse Re-
both loss functions, whereas ERMLP achieves comparable lations: Figure 19 reveals as for the previous datasets that
results for both training approaches. in general, the usage of inverse relations is beneficial for
ALI et al. 16
1.0
Model Loss / Training Approach Inverse Relations
0.8
0.6
hits@10 0.4
0.2
0.0
nsD
Tra E
nsH
Tuc R
kER
e
a)
a)
a)
)
ER ult
P
lE
2E
N
RE E
Ro L
tatE
UM
e
Dis lEx
Tra E
wa
Tru
A
CE
MR
SA
s
ns
ML
pl
ns
NT
(lcw
(ow
(lcw
Ho
Pro
tM
Fal
KG
SC
mp
Sim
L (o
Tra
Tra
NS
EL
Co
EL
SP
SP
BC
BC
Fig. 16. Overall hits@10 results for FB15K-237 where box-plots summarize the results across different combinations of interaction models, training
approaches, loss functions, and the explicit usage of inverse relations.
CEL
MRL
sLCWA
NSSAL
Fig. 17. Impact of the training approach on the performance for a fixed
loss function for the FB15K-237 dataset.
BCEL (LCWA) Inverse Relations
False
BCEL SPL True
ComplEx BCEL (sLCWA)
Training Approach
DistMult LCWA
ERMLP sLCWA CEL
HolE
KG2E
NTN MRL
ProjE
RESCAL NSSAL
RotatE
SimplE
TransD SPL (LCWA)
TransE
TransH
SPL (sLCWA)
UM
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10 hits@10
1.0
Model Loss / Training Approach Inverse Relations
0.8
0.6
hits@10 0.4
0.2
0.0
nsD
nsE
e
)
)
B
lt
lE
AL
tatE
SE
e
lEx
wa
wa
tMu
Tru
MR
SA
s
nvK
ML
Ho
Fal
SC
mp
L (o
L (o
Tra
Tra
Ro
NS
ER
Dis
Co
RE
Co
SP
BC
Fig. 22. Overall hits@10 results for YAGO3-10 where box-plots summarize the results across different combinations of interaction models, training
approaches, loss functions, and the explicit usage of inverse relations. In contrast, to the previous datasets, the models have only been trained
based on the stochastic local closed world assumption.
BCEL MRL
ComplEx
ConvKB BCEL
DistMult
ERMLP
HolE
RESCAL MRL
RotatE Inverse Relations
SE
False
True
TransD
NSSAL
TransE
NSSAL SPL
ComplEx
ConvKB SPL
DistMult
ERMLP
0.0 0.2 0.4 0.6 0.8 1.0
HolE hits@10
RESCAL
RotatE
SE Fig. 24. Impact of explicitly modeling inverse relations on the perfor-
TransD mance for a fixed loss function for the YAGO3-10 dataset.
TransE
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10
provides an empirical foundation for such studies and a
Fig. 23. Impact of the loss functions on the performance for a fixed practical tool to execute them.
interaction models for the YAGO3-10 dataset. Because this dataset was
only trained under the sLCWA, the boxers are not split like for the other
datasets.
ACKNOWLEDGMENT
We want to thank the Center for Information Services and
High Performance Computing (ZIH) at TU Dresden for
based on the LCWA-training, the KGEM only learns to generous allocations of computer time and the Technical
perform one-side predictions (i.e., it learns to either predict University of Denmark for providing us access to their
head or tail entities), but during the evaluation, it is asked DTU Compute GPU cluster that enabled us to conduct our
to perform both-side predictions. Through the inclusion of studies. This work was funded by the German Federal Min-
inverse relations, the model learns to perform both-side istry of Education and Research (BMBF) under Grant No.
predictions based on one side, i.e., (∗, r, t) can be predicted 01IS18036A and Grant No. 01IS18050D (project “MLWin”)
through (t, rinverse , ∗). Overall, our results indicate that as well as the Innovation Fund Denmark with the Danish
further investigations on FB15K-237 and YAGO3-10 might Center for Big Data Analytics driven Innovation (DABAI).
lead to results that are competitive to the state-of-the-art.
Looking forwards, it would be of great interest to re- R EFERENCES
investigate previously performed studies that analyze the
[1] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph em-
relationship between the performance of KGEMs and the bedding: A survey of approaches and applications,” IEEE Trans.
properties of the underlying KGs to verify that their findings Knowl. Data Eng., vol. 29, no. 12, pp. 2724–2743, 2017.
indeed can be attributed to the interaction model alone, rather [2] F. Akrami, L. Guo, W. Hu, and C. Li, “Re-evaluating embedding-
based knowledge graph completion methods,” in CIKM. ACM,
than the exact configuration including the loss function, 2018, pp. 1779–1782.
the training approach and the explicit modeling of inverse [3] R. Kadlec, O. Bajgar, and J. Kleindienst, “Knowledge base comple-
relations. Further, the effect of explicitly modeling inverse tion: Baselines strike back,” in Rep4NLP@ACL. Association for
relations has not been analyzed in depth, in particular how Computational Linguistics, 2017, pp. 69–74.
[4] Z. Sun, S. Vashishth, S. Sanyal, P. P. Talukdar, and Y. Yang, “A
the learned representations of a relation and its inverse re-evaluation of knowledge graph completion methods,” in ACL.
are related to each other. Ultimately, we believe our work Association for Computational Linguistics, 2020, pp. 5516–5522.
ALI et al. 20
BCEL MRL [18] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity and
ComplEx relation embeddings for knowledge graph completion,” in AAAI.
ConvKB AAAI Press, 2015, pp. 2181–2187.
DistMult [19] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao, “Knowledge graph
embedding via dynamic mapping matrix,” in ACL (1). The
ERMLP
Association for Computer Linguistics, 2015, pp. 687–696.
HolE [20] M. Nickel, V. Tresp, and H. Kriegel, “A three-way model for
RESCAL collective learning on multi-relational data,” in ICML. Omnipress,
RotatE 2011, pp. 809–816.
SE [21] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard,
Inverse Relations “Complex embeddings for simple link prediction,” in ICML, ser.
TransD False JMLR Workshop and Conference Proceedings, vol. 48. JMLR.org,
TransE True
2016, pp. 2071–2080.
NSSAL SPL [22] Z. Sun, Z. Deng, J. Nie, and J. Tang, “Rotate: Knowledge graph
ComplEx embedding by relational rotation in complex space,” in ICLR
(Poster). OpenReview.net, 2019.
ConvKB
[23] S. M. Kazemi and D. Poole, “Simple embedding for link prediction
DistMult in knowledge graphs,” in NeurIPS, 2018, pp. 4289–4300.
ERMLP [24] I. Balazevic, C. Allen, and T. M. Hospedales, “Tucker: Tensor fac-
HolE torization for knowledge graph completion,” in EMNLP/IJCNLP
RESCAL (1). Association for Computational Linguistics, 2019, pp. 5184–
5193.
RotatE
[25] L. R. Tucker et al., “The extension of factor analysis to three-
SE dimensional matrices,” Contributions to mathematical psychology,
TransD vol. 110119, 1964.
TransE [26] B. Shi and T. Weninger, “Proje: Embedding projection for knowl-
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
edge graph completion,” in AAAI. AAAI Press, 2017, pp. 1236–
hits@10 hits@10 1242.
[27] M. Nickel, L. Rosasco, and T. A. Poggio, “Holographic embed-
dings of knowledge graphs,” in AAAI. AAAI Press, 2016, pp.
Fig. 25. Impact of explicitly modeling inverse relations on the perfor- 1955–1961.
mance for a fixed interaction model and loss function for the YAGO3-10 [28] S. He, K. Liu, G. Ji, and J. Zhao, “Learning to represent knowledge
dataset. graphs with gaussian embedding,” in CIKM. ACM, 2015, pp.
623–632.
[29] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy,
[5] B. Yang, W. Yih, X. He, J. Gao, and L. Deng, “Embedding entities T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: a web-
and relations for learning and inference in knowledge bases,” in scale approach to probabilistic knowledge fusion,” in KDD. ACM,
ICLR (Poster), 2015. 2014, pp. 601–610.
[6] F. Akrami, M. S. Saeef, Q. Zhang, W. Hu, and C. Li, “Realistic re- [30] R. Socher, D. Chen, C. D. Manning, and A. Y. Ng, “Reasoning with
evaluation of knowledge graph completion methods: An experi- neural tensor networks for knowledge base completion,” in NIPS,
mental study,” in SIGMOD Conference. ACM, 2020, pp. 1995–2010. 2013, pp. 926–934.
[7] S. K. Mohamed, V. Novácek, P. Vandenbussche, and E. Muñoz, [31] D. Q. Nguyen, T. D. Nguyen, D. Q. Nguyen, and D. Phung, “A
“Loss functions in knowledge graph embedding models,” in novel embedding model for knowledge base completion based
DL4KG@ESWC, ser. CEUR Workshop Proceedings, vol. 2377. on convolutional neural network,” arXiv preprint arXiv:1712.02121,
CEUR-WS.org, 2019, pp. 1–10. 2017.
[8] D. Ruffinelli, S. Broscheit, and R. Gemulla, “You CAN teach an [32] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolu-
old dog new tricks! on training knowledge graph embeddings,” tional 2d knowledge graph embeddings,” in AAAI. AAAI Press,
in ICLR. OpenReview.net, 2020. 2018, pp. 1811–1818.
[9] A. Rossi, D. Firmani, A. Matinata, P. Merialdo, and D. Barbosa, [33] H. Zhang, Z. Kyaw, S. Chang, and T. Chua, “Visual translation
“Knowledge graph embedding for link prediction: A comparative embedding network for visual relation detection,” in CVPR. IEEE
analysis,” CoRR, vol. abs/2002.00819, 2020. Computer Society, 2017, pp. 3107–3115.
[10] M. Berrendorf, E. Faerman, L. Vermue, and V. Tresp, “Interpretable [34] S. Sharifzadeh, M. Berrendorf, and V. Tresp, “Improving visual
and fair comparison of link prediction or entity alignment meth- relation detection using depth maps,” CoRR, vol. abs/1905.00966,
ods with adjusted mean rank,” CoRR, vol. abs/2002.06914, 2020. 2019.
[11] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich, “A review [35] T. Lacroix, N. Usunier, and G. Obozinski, “Canonical tensor
of relational machine learning for knowledge graphs,” Proc. IEEE, decomposition for knowledge base completion,” in ICML, ser.
vol. 104, no. 1, pp. 11–33, 2016. Proceedings of Machine Learning Research, vol. 80. PMLR, 2018,
[12] B. Kotnis and V. Nastase, “Analysis of the impact of negative pp. 2869–2878.
sampling on link prediction in knowledge graphs,” CoRR, vol. [36] N. Fuhr, “Some common mistakes in IR evaluation, and how they
abs/1708.06816, 2017. can be avoided,” SIGIR Forum, vol. 51, no. 3, pp. 32–41, 2017.
[13] L. A. Galárraga, C. Teflioudi, K. Hose, and F. Suchanek, “Amie: [37] S. S. Stevens et al., “On the theory of scales of measurement,” 1946.
association rule mining under incomplete evidence in ontological [38] K. Toutanova and D. Chen, “Observed versus latent features for
knowledge bases,” in Proceedings of the 22nd international conference knowledge base and text inference,” in Proceedings of the 3rd Work-
on World Wide Web, 2013, pp. 413–422. shop on Continuous Vector Space Models and their Compositionality,
[14] A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “A semantic 2015, pp. 57–66.
matching energy function for learning with multi-relational data - [39] W. W. Denham, “The detection of patterns in alyawara nonverbal
application to word-sense disambiguation,” Mach. Learn., vol. 94, behavior,” Ph.D. dissertation, University of Washington, Seattle.,
no. 2, pp. 233–259, 2014. 1973.
[15] A. Bordes, J. Weston, R. Collobert, and Y. Bengio, “Learning [40] R. J. Rummel, The dimensionality of nations project: attributes of
structured embeddings of knowledge bases,” in AAAI. AAAI nations and behavior of nations dyads, 1950-1965. Inter-university
Press, 2011. Consortium for Political Research, 1976, no. 5409.
[16] A. Bordes, N. Usunier, A. Garcı́a-Durán, J. Weston, and [41] A. T. McCray, “An upper-level ontology for the biomedical do-
O. Yakhnenko, “Translating embeddings for modeling multi- main,” International Journal of Genomics, vol. 4, no. 1, pp. 80–84,
relational data,” in NIPS, 2013, pp. 2787–2795. 2003.
[17] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph [42] T. Rebele, F. M. Suchanek, J. Hoffart, J. Biega, E. Kuzey, and
embedding by translating on hyperplanes,” in AAAI. AAAI Press, G. Weikum, “YAGO: A multilingual knowledge base from
2014, pp. 1112–1119. wikipedia, wordnet, and geonames,” in International Semantic Web
ALI et al. 21
A PPENDIX
TABLE 6
Summary of main insights over all datasets. Each component (i.e., interaction model, loss function, and training approach) is considered to be
among the top-ten performing configurations when they occur at least once in the top-ten performing configurations. Note that a single component
is part of several configurations, and therefore, can occur multiple times in the top-ten performing configurations.
Interaction Models
RotatE Among top-ten-performing interaction models across all datasets.
TransE Among top-ten-performing interaction models on WN18RR and FB15K-237.
DistMult Among top-ten-performing interaction models on FB15K-237.
TuckER Among top-ten-performing interaction models for Kinships, and FB15K-237 (has not been evaluated on YAGO3-10).
ComplEx Among top-ten-performing interaction models on YAGO3-10.
ConvE Among top-ten-performing interaction models on WN18RR (has not been evaluated on FB15K-237 and YAGO3-10).
HolE Among top-ten-performing interaction models on YAGO3-10.
ERMLP Among top-ten-performing interaction models on YAGO3-10.
RESCAL Among top-ten-performing interaction models on FB15K-237.
SimplE Among top-ten-performing interaction models on Kinships.
Loss Functions
BCEL Among top-ten-performing loss functions across all datasets.
NSSAL Among top-ten-performing loss functions across all datasets.
SPL Among top-ten-performing loss functions across all datasets.
CEL Among top-ten-performing loss functions on Kinships and FB15K-237 (has not been evaluated on YAGO3-10).
MRL Among top-ten-performing loss functions on Kinships and FB15K-237.
Training Approaches
sLCWA Among top-ten-performing training approaches across all datasets.
LCWA Among top-ten-performing training approaches on Kinships, WN18RR and FB15K-237 (has not been evaluated on YAGO3-
10).
Explicit Modeling of Inverse Relations
Is usually beneficial in combination with the local closed world assumption.
Configurations
Performance Appropriate combination of interaction model, training assumption, loss function, choice of explicitly modeling inverse
relations is crucial for the performance, e.g., TransE can compete when with several state-of-the-art interaction models on
WN18RR when appropriate configuration is selected.
There is no single best configuration that works best for all dataset.
Variance Some interaction models exhibit a high variance across different configurations, e.g., RotatE on YAGO3-10 (Figure 22)
Pareto-Optimal Tables 16-19 describe Pareto-optimal configurations. It can be seen that there are configurations that require fewer
Configurations parameters while obtaining almost the same performance. In some cases, for the same interaction model, the model can be
significantly compressed.
Reproducibility
Results For FB15K, two out of 12, for WN18, six out of 12, for FB15K-237, one out of three, and for WN18RR, two out of three
experiments can be categorized as soft-reproducible.
Code For four out of 13 models, no official implementation was available.
Parameters For four out of 13 paper, source code was available and full experimental setup was precisely described.
General Insights
SOTA For WN18RR, we achieve based on a RotatE-configuration (together with Graph Attenuated Attention Networks [44]) state-
of-the-art results in terms of hits@10 through our study (60.09% Hits@10). Furthermore, we found a TransE configuration
that achieves high performance beating most of the published SOTA results (56.98% Hits@10). Based on our results, we
emphasize to further investigate the hyper-parameters space for the most promising configurations for the remaining
benchmarking datasets.
Improvements For ConvE (56.10% compared to 52.00% [32]) and TuckER (54.26% compared to 52.6% [24]), we are beating the reported
results in the original papers due selecting appropriate configurations and hyper-parameters on WN18RR.
ALI et al. A2
TABLE 7
Reproduction of Studies on FB15K without indicating the standard deviation where pub refers to published results, R to results based on the
realistic ranking, O to results based on the optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the
optimistic and pessimistic ranking in case they differ from the realistic ranking. Models highlighted in blue indicate the experiments are
soft-reproducible whereas models highlighted in red indicate the results represent extreme outliers. The full result table with standard deviation
across multiple runs can be found at Table 11.
MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ComplEx pub 69.2 59.9 75.9 84.
R 19.13 10.06 20.82 27.59 38.03 171.16 2.46
ConvE pub 65.7 55.8 72.3 83.1 51.
R 59.56 48.28 66.99 73.27 79.76 50.76 0.73
DistMult pub 35. 57.7
R 26.06 16.45 29.10 35.54 45.00 134.02 1.86
HolE pub 52.4 40.2 61.3 73.9
R 34.15 21.79 39.69 48.06 58.84 193.03 2.71
KG2E pub 71.5 59.
R 0.58 0.11 0.36 0.56 1.01 5779.07 78.40
RotatE pub 79.7 74.6 83. 88.4 40.
R 55.00 41.53 64.14 71.23 78.67 42.28 0.63
SimplE pub 72.7 66. 77.3 83.8
R 0.04 0.01 0.03 0.04 0.06 7395.75 100.02
O 23.90 11.58 24.16 34.73 54.28 139.34
P 0.03 0.01 0.03 0.04 0.06 14652.16
TransD pub 77.3 91.
R 33.99 21.22 40.48 48.57 58.71 153.37 2.29
TransE pub 47.1 125.
R 26.01 15.23 29.85 37.18 47.34 127.92 1.78
TransH pub 64.4 87.
R 2.54 1.69 2.95 3.29 3.74 6320.02 85.63
O 2.54 1.69 2.95 3.29 3.74 6320.00
P 2.54 1.69 2.95 3.29 3.74 6320.05
TransR pub 68.7 77.
R 0.65 0.37 0.63 0.78 1.03 6795.95 91.99
O 0.65 0.37 0.63 0.78 1.03 6795.94
P 0.65 0.37 0.63 0.78 1.03 6795.95
TuckER pub 79.5 74.1 83.3 89.2
R 0.07 0.01 0.02 0.03 0.15 7327.77 99.11
TABLE 8
Reproduction of Studies on FB15K-237 without indicating the standard deviation where pub refers to published results, R to results based on the
realistic ranking, O to results based on the optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the
optimistic and pessimistic ranking in case they differ from the realistic ranking. Models highlighted in blue indicate the experiments are
soft-reproducible whereas models highlighted in red indicate the results represent extreme outliers. The full result table with standard deviation
across multiple runs can be found at Table 12.
MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ConvE pub 32.5 23.7 35.6 50.1 244.
R 26.93 18.22 29.51 35.98 44.95 255.46 3.73
ConvKB pub 39.6 51.7 257.
R 4.71 3.31 4.04 4.57 7.76 4345.27 61.36
RotatE pub 33.8 24.1 37.5 53.3 177.
R 26.42 17.57 28.97 35.29 44.55 191.92 2.84
ALI et al. A3
TABLE 9
Reproduction of Studies on WN18 without indicating the standard deviation where pub refers to published results, R to results based on the
realistic ranking, O to results based on the optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the
optimistic and pessimistic ranking in case they differ from the realistic ranking. Models highlighted in blue indicate the experiments are
soft-reproducible whereas models highlighted in red indicate the results represent extreme outliers. The full result table with standard deviation
across multiple runs can be found at Table 13.
MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ComplEx pub 94.1 93.6 94.5 94.7
R 19.49 12.36 20.66 25.24 32.92 452.67 2.21
ConvE pub 94.3 93.5 94.6 95.6 374.
R 88.81 85.14 91.76 93.29 94.85 444.40 2.17
DistMult pub 83. 94.2
R 77.44 67.45 85.94 89.52 92.72 458.64 2.24
HolE pub 93.8 93. 94.5 94.9
R 70.44 59.29 79.29 84.12 88.61 812.64 3.97
O 70.44 59.29 79.29 84.12 88.61 812.63
P 70.44 59.29 79.29 84.12 88.61 812.64
KG2E pub 92.8 331.
R 3.61 1.35 3.21 4.57 7.02 2708.89 13.25
O 3.61 1.35 3.21 4.57 7.02 2708.88
P 3.61 1.35 3.21 4.57 7.02 2708.89
RotatE pub 94.9 94.4 95.2 95.9 309.
R 87.29 82.17 91.53 93.44 95.28 123.68 0.61
SimplE pub 94.2 93.9 94.4 94.7
R 0.04 0.01 0.03 0.04 0.07 20376.43 99.57
O 38.48 33.93 39.59 42.76 47.01 384.53
P 0.03 0.01 0.03 0.04 0.07 40368.33
TransD pub 92.2 212.
R 36.22 3.94 65.63 79.64 87.27 444.39 2.17
TransE pub 89.2 251.
R 39.19 9.99 64.74 75.44 84.25 468.24 2.29
TransH pub 82.3 388.
R 0.18 0.04 0.19 0.29 0.39 19678.04 96.16
O 0.18 0.04 0.19 0.29 0.39 19678.02
P 0.18 0.04 0.19 0.29 0.39 19678.06
TransR pub 92. 225.
R 0.06 0.00 0.04 0.05 0.11 19686.49 96.20
O 0.06 0.00 0.04 0.05 0.11 19686.49
P 0.06 0.00 0.04 0.05 0.11 19686.50
TuckER pub 95.3 94.9 95.5 95.8
R 0.03 0.00 0.02 0.03 0.04 20622.46 100.78
TABLE 10
Reproduction of Studies on WN18RR without indicating the standard deviation where pub refers to published results, R to results based on the
realistic ranking, O to results based on the optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the
optimistic and pessimistic ranking in case they differ from the realistic ranking. Models highlighted in blue indicate the experiments are
soft-reproducible whereas models highlighted in red indicate the results represent extreme outliers. The full result table with standard deviation
across multiple runs can be found at Table 14.
MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ConvE pub 43. 4. 44. 52. 4187.
R 44.69 40.98 46.49 48.92 51.76 5369.49 26.49
ConvKB pub 24.8 52.5 2554.
R 0.30 0.09 0.21 0.32 0.57 13634.66 67.27
O 0.30 0.09 0.21 0.32 0.57 13634.65
P 0.30 0.09 0.21 0.32 0.57 13634.66
RotatE pub 47.6 42.8 49.2 57.1 3340.
R 48.40 44.02 50.55 52.98 56.51 4263.32 21.03
ALI et al. A4
TABLE 11
Reproduction of Studies on FB15K where pub refers to published results, R to results based on the realistic ranking, O to results based on the
optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the optimistic and pessimistic ranking in case
they differ from the realistic ranking.
MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ComplEx pub 69.2 59.9 75.9 84.
R 19.13 ± 0.45 10.06 ± 0.42 20.82 ± 0.67 27.59 ± 0.94 38.03 ± 1.02 171.16 ± 16.05 2.46 ± 0.28
ConvE pub 65.7 55.8 72.3 83.1 51.
R 59.56 ± 0.06 48.28 ± 0.12 66.99 ± 0.04 73.27 ± 0.03 79.76 ± 0.07 50.76 ± 0.40 0.73 ± 0.01
DistMult pub 35. 57.7
R 26.06 ± 0.17 16.45 ± 0.16 29.10 ± 0.17 35.54 ± 0.21 45.00 ± 0.25 134.02 ± 1.98 1.86 ± 0.03
HolE pub 52.4 40.2 61.3 73.9
R 34.15 ± 0.22 21.79 ± 0.19 39.69 ± 0.24 48.06 ± 0.30 58.84 ± 0.28 193.03 ± 7.61 2.71 ± 0.12
KG2E pub 71.5 59.
R 0.58 ± 0.07 0.11 ± 0.04 0.36 ± 0.08 0.56 ± 0.10 1.01 ± 0.14 5779.07 ± 51.02 78.40 ± 0.68
RotatE pub 79.7 74.6 83. 88.4 40.
R 55.00 ± 0.06 41.53 ± 0.06 64.14 ± 0.07 71.23 ± 0.05 78.67 ± 0.08 42.28 ± 0.13 0.63 ± 0.00
SimplE pub 72.7 66. 77.3 83.8
R 0.04 ± 0.00 0.01 ± 0.00 0.03 ± 0.00 0.04 ± 0.00 0.06 ± 0.01 7395.75 ± 2.02 100.02 ± 0.03
O 23.90 ± 8.79 11.58 ± 6.42 24.16 ± 10.95 34.73 ± 13.40 54.28 ± 15.80 139.34 ± 49.45
P 0.03 ± 0.00 0.01 ± 0.00 0.03 ± 0.00 0.04 ± 0.00 0.06 ± 0.01 14652.16 ± 45.71
TransD pub 77.3 91.
R 33.99 ± 0.03 21.22 ± 0.03 40.48 ± 0.10 48.57 ± 0.09 58.71 ± 0.14 153.37 ± 5.35 2.29 ± 0.09
TransE pub 47.1 125.
R 26.01 ± 0.17 15.23 ± 0.16 29.85 ± 0.24 37.18 ± 0.24 47.34 ± 0.18 127.92 ± 0.86 1.78 ± 0.01
TransH pub 64.4 87.
R 2.54 ± 0.20 1.69 ± 0.25 2.95 ± 0.20 3.29 ± 0.22 3.74 ± 0.18 6320.02 ± 30.37 85.63 ± 0.40
O 2.54 ± 0.20 1.69 ± 0.25 2.95 ± 0.20 3.29 ± 0.22 3.74 ± 0.18 6320.00 ± 30.37
P 2.54 ± 0.20 1.69 ± 0.25 2.95 ± 0.20 3.29 ± 0.22 3.74 ± 0.18 6320.05 ± 30.37
TransR pub 68.7 77.
R 0.65 ± 0.02 0.37 ± 0.00 0.63 ± 0.04 0.78 ± 0.06 1.03 ± 0.07 6795.95 ± 16.65 91.99 ± 0.22
O 0.65 ± 0.02 0.37 ± 0.00 0.63 ± 0.04 0.78 ± 0.06 1.03 ± 0.07 6795.94 ± 16.65
P 0.65 ± 0.02 0.37 ± 0.00 0.63 ± 0.04 0.78 ± 0.06 1.03 ± 0.07 6795.95 ± 16.65
TuckER pub 79.5 74.1 83.3 89.2
R 0.07 ± 0.02 0.01 ± 0.00 0.02 ± 0.00 0.03 ± 0.01 0.15 ± 0.17 7327.77 ± 29.22 99.11 ± 0.39
TABLE 12
Reproduction of Studies on FB15K-237 where pub refers to published results, R to results based on the realistic ranking, O to results based on
the optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the optimistic and pessimistic ranking in case
they differ from the realistic ranking.
MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ConvE pub 32.5 23.7 35.6 50.1 244.
R 26.93 ± 0.11 18.22 ± 0.11 29.51 ± 0.24 35.98 ± 0.16 44.95 ± 0.17 255.46 ± 6.16 3.73 ± 0.13
ConvKB pub 39.6 51.7 257.
R 4.71 ± 0.23 3.31 ± 0.23 4.04 ± 0.19 4.57 ± 0.22 7.76 ± 0.88 4345.27 ± 46.99 61.36 ± 0.65
RotatE pub 33.8 24.1 37.5 53.3 177.
R 26.42 ± 0.04 17.57 ± 0.06 28.97 ± 0.05 35.29 ± 0.09 44.55 ± 0.06 191.92 ± 0.31 2.84 ± 0.00
ALI et al. A5
TABLE 13
Reproduction of Studies on WN18 where pub refers to published results, R to results based on the realistic ranking, O to results based on the
optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the optimistic and pessimistic ranking in case
they differ from the realistic ranking.
MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ComplEx pub 94.1 93.6 94.5 94.7
R 19.49 ± 2.55 12.36 ± 1.96 20.66 ± 2.75 25.24 ± 3.33 32.92 ± 4.40 452.67 ± 63.05 2.21 ± 0.31
ConvE pub 94.3 93.5 94.6 95.6 374.
R 88.81 ± 0.09 85.14 ± 0.10 91.76 ± 0.11 93.29 ± 0.04 94.85 ± 0.06 444.40 ± 14.82 2.17 ± 0.07
DistMult pub 83. 94.2
R 77.44 ± 0.22 67.45 ± 0.34 85.94 ± 0.21 89.52 ± 0.25 92.72 ± 0.18 458.64 ± 23.96 2.24 ± 0.12
HolE pub 93.8 93. 94.5 94.9
R 70.44 ± 0.45 59.29 ± 0.53 79.29 ± 0.47 84.12 ± 0.36 88.61 ± 0.42 812.64 ± 28.33 3.97 ± 0.14
O 70.44 ± 0.45 59.29 ± 0.53 79.29 ± 0.47 84.12 ± 0.36 88.61 ± 0.42 812.63 ± 28.33
P 70.44 ± 0.45 59.29 ± 0.53 79.29 ± 0.47 84.12 ± 0.36 88.61 ± 0.42 812.64 ± 28.34
KG2E pub 92.8 331.
R 3.61 ± 0.26 1.35 ± 0.22 3.21 ± 0.31 4.57 ± 0.34 7.02 ± 0.43 2708.89 ± 44.57 13.25 ± 0.22
O 3.61 ± 0.26 1.35 ± 0.22 3.21 ± 0.31 4.57 ± 0.34 7.02 ± 0.43 2708.88 ± 44.57
P 3.61 ± 0.26 1.35 ± 0.22 3.21 ± 0.31 4.57 ± 0.34 7.02 ± 0.43 2708.89 ± 44.57
RotatE pub 94.9 94.4 95.2 95.9 309.
R 87.29 ± 0.12 82.17 ± 0.20 91.53 ± 0.12 93.44 ± 0.07 95.28 ± 0.08 123.68 ± 1.71 0.61 ± 0.01
SimplE pub 94.2 93.9 94.4 94.7
R 0.04 ± 0.01 0.01 ± 0.01 0.03 ± 0.00 0.04 ± 0.01 0.07 ± 0.03 20376.43 ± 42.30 99.57 ± 0.21
O 38.48 ± 4.00 33.93 ± 4.32 39.59 ± 4.67 42.76 ± 3.73 47.01 ± 2.66 384.53 ± 66.45
P 0.03 ± 0.01 0.01 ± 0.01 0.03 ± 0.00 0.04 ± 0.01 0.07 ± 0.03 40368.33 ± 114.95
TransD pub 92.2 212.
R 36.22 ± 0.12 3.94 ± 0.27 65.63 ± 0.55 79.64 ± 0.43 87.27 ± 0.41 444.39 ± 25.61 2.17 ± 0.13
TransE pub 89.2 251.
R 39.19 ± 1.21 9.99 ± 1.82 64.74 ± 0.91 75.44 ± 0.48 84.25 ± 0.33 468.24 ± 13.64 2.29 ± 0.07
TransH pub 82.3 388.
R 0.18 ± 0.04 0.04 ± 0.02 0.19 ± 0.10 0.29 ± 0.14 0.39 ± 0.11 19678.04 ± 18.92 96.16 ± 0.09
O 0.18 ± 0.04 0.04 ± 0.02 0.19 ± 0.10 0.29 ± 0.14 0.39 ± 0.11 19678.02 ± 18.92
P 0.18 ± 0.04 0.04 ± 0.02 0.19 ± 0.10 0.29 ± 0.14 0.39 ± 0.11 19678.06 ± 18.92
TransR pub 92. 225.
R 0.06 ± 0.02 0.00 ± 0.00 0.04 ± 0.02 0.05 ± 0.03 0.11 ± 0.06 19686.49 ± 100.97 96.20 ± 0.49
O 0.06 ± 0.02 0.00 ± 0.00 0.04 ± 0.02 0.05 ± 0.03 0.11 ± 0.06 19686.49 ± 100.97
P 0.06 ± 0.02 0.00 ± 0.00 0.04 ± 0.02 0.05 ± 0.03 0.11 ± 0.06 19686.50 ± 100.97
TuckER pub 95.3 94.9 95.5 95.8
R 0.03 ± 0.01 0.00 ± 0.00 0.02 ± 0.01 0.03 ± 0.03 0.04 ± 0.03 20622.46 ± 153.52 100.78 ± 0.75
TABLE 14
Reproduction of Studies on WN18RR where pub refers to published results, R to results based on the realistic ranking, O to results based on the
optimistic ranking, and P to results based on the pessimistic ranking. We only show the results of the optimistic and pessimistic ranking in case
they differ from the realistic ranking.
MRR (%) Hits@1 (%) Hits@3 (%) Hits@5 (%) Hits@10 (%) MR AMR (%)
Model
ConvE pub 43. 4. 44. 52. 4187.
R 44.69 ± 0.21 40.98 ± 0.22 46.49 ± 0.14 48.92 ± 0.23 51.76 ± 0.13 5369.49 ± 50.92 26.49 ± 0.25
ConvKB pub 24.8 52.5 2554.
R 0.30 ± 0.07 0.09 ± 0.03 0.21 ± 0.07 0.32 ± 0.10 0.57 ± 0.16 13634.66 ± 714.24 67.27 ± 3.52
O 0.30 ± 0.07 0.09 ± 0.03 0.21 ± 0.07 0.32 ± 0.10 0.57 ± 0.16 13634.65 ± 714.24
P 0.30 ± 0.07 0.09 ± 0.03 0.21 ± 0.07 0.32 ± 0.10 0.57 ± 0.16 13634.66 ± 714.24
RotatE pub 47.6 42.8 49.2 57.1 3340.
R 48.40 ± 0.09 44.02 ± 0.15 50.55 ± 0.12 52.98 ± 0.11 56.51 ± 0.26 4263.32 ± 90.33 21.03 ± 0.45
ALI et al. A6
TABLE 15
Model sizes in bytes for the best reported configurations studied for the the reproducibility study.
TABLE 16
Pareto-optimal models for FB15K-237 regarding Model Bytes and Hits@10
Model Loss Training Approach Inverse Relations Model Bytes Hits@10 (%)
RotatE NSSAL sLCWA no 28.8 MiB 48.060
RotatE NSSAL sLCWA yes 14.6 MiB 47.742
DistMult CEL LCWA yes 3.7 MiB 46.032
TransE SPL LCWA no 3.6 MiB 43.722
UM MRL sLCWA yes 3.5 MiB 3.166
UM MRL sLCWA no 3.5 MiB 3.075
TABLE 17
Pareto-optimal models for Kinships regarding Model Bytes and Hits@10
Model Loss Training Approach Inverse Relations Model Bytes Hits@10 (%)
RotatE MRL sLCWA no 258.0 KiB 98.743
RotatE NSSAL sLCWA yes 77.0 KiB 98.417
ProjE NSSAL sLCWA yes 39.3 KiB 95.438
ProjE SPL sLCWA no 33.0 KiB 94.088
HolE SPL LCWA no 32.2 KiB 87.384
UM SPL LCWA no 26.0 KiB 10.335
UM MRL sLCWA no 26.0 KiB 6.844
TABLE 18
Pareto-optimal models for WN18RR regarding Model Bytes and Hits@10
Model Loss Training Approach Inverse Relations Model Bytes Hits@10 (%)
RotatE SPL LCWA yes 79.3 MiB 60.089
RotatE BCEL LCWA yes 19.8 MiB 59.337
ConvE SPL LCWA yes 10.6 MiB 53.129
ConvKB SPL sLCWA yes 5.0 MiB 41.365
ConvKB NSSAL sLCWA no 2.5 MiB 20.332
TABLE 19
Pareto-optimal models for YAGO3-10 regarding Model Bytes and Hits@10
Model Loss Training Approach Inverse Relations Training Time Hits@10 (%)
RotatE NSSAL sLCWA yes 1.96 h 62.595
ComplEx SPL sLCWA no 53.89 min 62.224
ComplEx BCEL sLCWA no 37.45 min 59.916
TransE MRL sLCWA no 12.26 min 40.837
TransE SPL sLCWA no 11.03 min 16.128
ALI et al. A8
TABLE 20
Best configuration for each model in FB15K-237
TABLE 21
Best configuration for each model in Kinships
TABLE 22
Best configuration for each model in WN18RR
TABLE 23
Best configuration for each model in YAGO3-10
FB15k237 Kinships
0.5 1.0
0.4 0.8
0.3 0.6
hits@10
0.2
0.4
0.1
0.2 Model
ComplEx
0.0 ConvE
0.0
10
7
10
8
10
5
10
6
10
7 ConvKB
DistMult
ERMLP
WN18RR YAGO310 HolE
0.6 ProjE
0.6
RotatE
0.5 0.5 TransE
UM
0.4 0.4
hits@10
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
7 8 7 7 7 8 8
10 10 3 × 10 4 × 10 6 × 10 10 2 × 10
Model Bytes Model Bytes
Fig. 26. Scatter plots comparing model size in number of bytes and model performance in terms of Hits@10 for all trained models on each dataset.
The color indicates the model type, and the model size is shown on a logarithmic axis. Pareto-optimal models are highlighted by cross symbols. In
general we only see a low correlation between model size and performance. A more thorough comparison can be found in Figures 4, 10, 16, and
22.
ALI et al. A10
inverse_relations = False
KG2E 10.06% 64.29% 25.09% KG2E 10.66% 88.87% 89.48% 48.74%
NTN 10.20% 9.87% 10.01% NTN 11.50% 10.75% 9.17% 9.92%
ProjE 17.46% 81.42% 16.20% ProjE 85.20% 85.06% 91.62% 86.59%
RESCAL 31.10% 90.76% 88.97% RESCAL 86.45% 93.11%
RotatE 11.31% 96.93% 12.38% RotatE 90.18% 97.58% 95.76% 88.08%
SE 94.83% 93.48% 95.34% SE 96.79%
SimplE 53.35% 93.95% 46.00% SimplE 92.69% 89.90% 93.30% 93.90%
TransD 23.60% 37.29% 22.07% TransD 31.89% 35.34% 24.58% 23.42%
TransE 69.55% 82.26% 60.01% TransE 67.64% 85.43% 87.38% 68.99%
TransR 55.26% 61.41% 51.02% TransR 68.67% 69.60% 63.13% 69.18%
TuckER 94.65% 96.07% 94.41% TuckER 96.65% 97.72%
UM 9.87% 9.68% 9.17% UM 10.94% 4.70% 6.56% 8.19%
BCEL CEL SPL BCEL MRL NSSAL SPL
ComplEx 84.31% 97.49% 83.99% ComplEx 97.63% 96.32% 97.86% 97.44%
ConvE 97.63% 98.04% 97.25% ConvE 96.65% 96.83% 98.28% 97.25%
ConvKB 23.09% 84.78% 21.28% ConvKB 91.11% 95.34% 89.34% 83.57%
DistMult 10.66% 85.66% 10.94% DistMult 55.87% 86.36% 55.17% 55.45%
ERMLP 84.45% 89.76% 87.24% ERMLP 93.30% 94.09% 95.34% 94.18%
HolE 86.92% 92.78% 87.24% HolE 85.89% 86.03% 86.31% 86.13%
inverse_relations = True
KG2E 9.73% 88.08% 32.59% KG2E 8.85% 91.06% 89.57% 82.87%
NTN 11.17% 9.40% 10.24% NTN 10.80% 10.06% 10.20% 9.31%
ProjE 26.26% 92.09% 21.32% ProjE 88.73% 91.57% 92.55% 90.83%
RESCAL 92.74% 94.41% 93.30% RESCAL 92.97% 96.42%
RotatE 10.89% 97.21% 11.41% RotatE 87.15% 97.25% 96.46% 85.47%
SE 96.18% 97.39% 96.23% SE 97.11% 98.09% 96.60%
SimplE 78.49% 96.09% 69.32% SimplE 94.79% 92.88% 94.04% 95.30%
TransD 23.00% 43.30% 26.72% TransD 42.27% 43.25% 26.44% 23.60%
TransE 74.16% 91.29% 65.60% TransE 67.04% 85.57% 86.82% 68.11%
TransR 54.00% 68.62% 54.00% TransR 57.31% 70.48% 67.27% 65.41%
TuckER 97.67% 97.65% 97.30% TuckER 97.91% 98.32% 91.95% 97.49%
UM 9.78% 7.50% 9.08% UM 8.99% 5.17% 4.98% 8.43%
BCEL CEL SPL BCEL MRL NSSAL SPL
Fig. 27. Results for all configurations on Kinships based on Adadelta. BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss,
MRL to the margin ranking loss, NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed
world assumption training approach and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A11
inverse_relations = False
KG2E 9.87% 16.11% 15.69% KG2E 10.71% 88.45% 21.93% 14.15%
NTN 81.66% 88.27% 86.17% NTN 82.26% 86.17% 58.24% 88.59%
ProjE 91.90% 91.01% 87.15% ProjE 94.65% 91.53% 92.27% 94.09%
RESCAL 54.10% 95.95% 83.66% RESCAL 92.60% 91.06% 95.30% 97.58%
RotatE 98.23% 98.14% 90.13% RotatE 93.99% 98.74% 98.51% 93.62%
SE 93.58% 94.55% 94.27% SE 96.00% 96.74% 96.14% 96.74%
SimplE 95.53% 96.69% 92.09% SimplE 97.91% 95.62% 98.00% 97.67%
TransD 34.87% 40.27% 21.51% TransD 40.32% 40.69% 31.38% 35.99%
TransE 62.38% 79.10% 66.85% TransE 69.23% 77.98% 87.62% 67.04%
TransH 27.05% 45.39% 27.70% TransH 41.67% 46.55% 47.86% 38.45%
TransR 62.34% 64.99% 62.24% TransR 69.37% 70.34% 67.78% 71.55%
TuckER 97.72% 97.30% 97.25% TuckER 97.72% 97.77% 97.11%
UM 9.08% 9.22% 10.34% UM 9.78% 6.84% 9.03% 8.52%
BCEL CEL SPL BCEL MRL NSSAL SPL
ComplEx 96.88% 97.58% 96.09% ComplEx 97.91% 94.51% 97.30% 97.81%
ConvE 98.14% 97.72% 97.91% ConvE 94.93% 93.34% 97.35% 94.69%
ConvKB 94.74% 95.25% 95.62% ConvKB 95.07% 95.44% 96.97% 96.74%
DistMult 86.31% 92.27% 84.17% DistMult 86.78% 86.13% 84.82% 86.59%
ERMLP 93.85% 95.25% 94.74% ERMLP 96.74% 95.95% 96.88% 96.37%
HolE 88.27% 92.50% 88.22% HolE 87.29% 86.22% 86.78% 86.31%
11.55% 18.95% 13.83% KG2E 9.68% 89.76% 71.14% 56.52%
inverse_relations = True
KG2E
NTN 88.18% 88.92% 86.08% NTN 92.18% 87.85% 71.18% 20.44%
ProjE 94.74% 93.72% 91.85% ProjE 95.20% 94.37% 95.44% 95.16%
RESCAL 86.87% 96.83% 89.25% RESCAL 87.94% 89.43% 94.93% 97.07%
RotatE 95.07% 98.23% 92.23% RotatE 93.44% 98.65% 98.42% 93.39%
SE 96.42% 97.35% 96.09% SE 96.60% 96.55% 96.14% 96.55%
SimplE 98.09% 97.91% 93.72% SimplE 98.14% 94.79% 96.79% 97.77%
TransD 33.85% 45.58% 38.08% TransD 33.57% 40.88% 37.06% 32.45%
TransE 69.74% 86.36% 62.57% TransE 66.62% 85.85% 80.96% 67.83%
TransH 35.71% 52.37% 35.52% TransH 33.43% 46.97% 30.07% 34.12%
TransR 69.83% 70.86% 70.20% TransR 70.53% 70.44% 68.48% 70.02%
TuckER 98.51% 98.14% 98.46% TuckER 98.14% 94.74% 97.72% 98.09%
UM 8.71% 9.36% 10.75% UM 8.99% 6.28% 8.47% 10.20%
BCEL CEL SPL BCEL MRL NSSAL SPL
Fig. 28. Results for all configurations on Kinships based on Adam. BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss,
MRL to the margin ranking loss, NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed
world assumption training approach and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A12
inverse_relations = False
KG2E 0.07% 42.68% 50.72% KG2E 0.00% 44.20% 46.02% 46.39%
NTN 1.54% NTN 1.04% 31.34% 0.14%
ProjE 44.48% 48.05% 36.56% ProjE 41.74% 40.53% 41.16% 42.13%
RESCAL 39.88% 53.04% 38.77% RESCAL 38.87% 35.98% 42.54% 40.94%
RotatE 58.14% 52.68% 57.78% RotatE 51.25% 51.20% 58.26% 51.28%
SE 10.64% SE 41.86% 45.38%
SimplE 45.37% 49.59% 38.17% SimplE 42.32% 38.37% 40.58% 42.41%
TransD 0.05% 14.83% 0.24% TransD 44.12% 46.31% 37.14% 42.73%
TransE 3.92% 33.94% 56.98% TransE 43.59% 49.15% 49.61% 42.80%
TransH 2.07% 13.56% 1.09% TransH 25.87% 47.81% 4.75% 38.10%
TransR 1.18% 14.45% 0.12% TransR 35.11% 42.53% 33.72% 35.69%
TuckER 49.97% 51.37% 48.77% TuckER 46.31% 45.09%
UM 44.70% 41.18% 45.04% UM 39.74% 39.33% 42.00% 39.79%
BCEL CEL SPL BCEL MRL NSSAL SPL
ComplEx 44.61% 14.71% 36.25% ComplEx 42.42% 42.00% 40.73% 41.81%
ConvE 53.47% 56.10% 53.13% ConvE 39.18% 0.72% 40.58% 31.12%
ConvKB ConvKB 40.49% 39.86% 46.87% 41.36%
DistMult 46.10% 52.96% 44.17% DistMult 50.19% 45.73% 48.56% 45.45%
ERMLP 42.60% 46.36% 41.46% ERMLP 47.52% 42.68% 45.33% 46.87%
HolE 44.96% 30.04% 33.04% HolE 47.42% 39.53% 42.53% 45.28%
0.02% 47.93% 50.39% KG2E 0.03% 42.41% 43.84% 45.64%
inverse_relations = True
KG2E
NTN 0.00% NTN 24.81% 23.85% 0.00%
ProjE 45.81% 51.74% 42.48% ProjE 42.80% 42.82% 43.45% 40.60%
RESCAL 35.98% 53.20% 40.92% RESCAL 38.92% 37.45% 41.79% 42.15%
RotatE 59.34% 54.96% 60.09% RotatE 49.15% 47.69% 57.93% 49.83%
SE 2.09% SE 41.42% 41.30%
SimplE 44.36% 50.92% 40.24% SimplE 41.38% 38.51% 39.26% 42.51%
TransD 0.10% 28.81% 0.07% TransD 44.65% 44.89% 33.16% 40.83%
TransE 56.40% 46.24% 23.89% TransE 44.29% 48.91% 50.02% 42.34%
TransH 1.42% 12.82% 1.37% TransH 34.32% 37.19% 4.74% 3.57%
TransR 0.07% 38.46% 0.14% TransR 32.85% 10.17% 28.54% 34.49%
TuckER 52.70% 54.26% 51.11% TuckER 50.67% 43.38%
UM 43.83% 42.15% 44.32% UM 39.69% 39.50% 41.48% 39.55%
BCEL CEL SPL BCEL MRL NSSAL SPL
Fig. 29. Results for all configurations on WN18RR based on Adam. BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss,
MRL to the margin ranking loss, NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed
world assumption training approach and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A13
inverse_relations = False
NTN 10.40% 18.78% NTN 2.65% 9.87% 0.05% 19.60%
ProjE 30.58% 28.58% 23.83% ProjE 28.36% 31.51% 25.00% 27.85%
RESCAL 26.36% 35.73% 20.35% RESCAL 24.64% 32.95% 31.24% 33.03%
RotatE 45.79% 41.80% 43.74% RotatE 35.98% 42.21% 48.06% 36.56%
SimplE 24.13% 34.73% 20.60% SimplE 27.48% 25.94% 29.64% 28.09%
TransD 3.72% 25.29% 0.72% TransD 27.71% 32.57% 24.18% 28.85%
TransE 42.95% 33.22% 43.72% TransE 29.37% 43.91% 40.66% 33.62%
TransH 16.15% 21.48% 19.96% TransH 26.55% 33.80% 25.17% 24.96%
TransR 10.15% 21.23% 9.82% TransR 34.65% 7.35%
TuckER 33.64% 40.71% 37.51% TuckER
UM 6.33% 8.00% 6.84% UM 6.01% 3.08% 5.59% 6.03%
BCEL CEL SPL BCEL MRL NSSAL SPL
ComplEx 32.88% 42.95% 27.31% ComplEx 30.78% 31.09% 28.26% 26.94%
DistMult 38.60% 46.03% 37.71% DistMult 31.51% 36.98% 33.72% 33.03%
ERMLP 43.10% 39.76% 38.52% ERMLP 33.90% 36.44% 30.52% 35.45%
HolE 37.05% 40.66% 36.28% HolE 33.62% 30.02% 30.54% 35.16%
KG2E 0.03% 36.59% 43.16% KG2E 0.05% 34.78% 32.54% 38.43%
16.17% 1.30% NTN 10.15% 3.01% 3.18% 4.23%
inverse_relations = True
NTN
ProjE 39.94% 40.15% 32.27% ProjE 24.01% 33.45% 23.84% 26.48%
RESCAL 43.73% 44.52% 31.92% RESCAL 27.59% 30.77% 25.73% 32.09%
RotatE 42.30% 45.71% 44.40% RotatE 35.54% 27.88% 47.74% 35.49%
SimplE 34.53% 38.67% 21.45% SimplE 25.27% 24.20% 25.62% 26.24%
TransD 2.58% 28.48% 2.80% TransD 24.22% 39.48% 24.81% 27.84%
TransE 38.18% 44.28% 41.36% TransE 29.01% 41.24% 38.61% 33.46%
TransH 26.47% 24.97% 26.81% TransH 24.15% 27.28% 22.23% 23.26%
TransR 16.06% 38.41% 12.02% TransR 32.26% 12.11%
TuckER 40.49% 43.61% 45.04% TuckER
UM 6.37% 7.88% 6.63% UM 5.89% 3.17% 5.67% 5.95%
BCEL CEL SPL BCEL MRL NSSAL SPL
Fig. 30. Results for all configurations on FB15K-237 based on Adam. BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss,
MRL to the margin ranking loss, NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed
world assumption training approach and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A14
training_approach = sLCWA
ComplEx 59.92% 51.90% 60.71% 62.22%
inverse_relations = False
ERMLP 56.74% 39.91% 51.92% 58.06%
CEL
Training Approach
MRL LCWA
sLCWA
NSSAL
SPL
CEL
MRL
NSSAL
SPL
CEL
MRL
NSSAL
SPL
CEL
MRL
NSSAL
SPL
CEL
MRL
NSSAL
SPL
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10 hits@10
Fig. 32. Impact of the training approach on the performance for a fixed interaction model and loss function for the Kinships dataset (results are
aggregated based on the medians). BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss, MRL to the margin ranking loss,
NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed world assumption training approach
and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A16
SPL
MRL
NSSAL
CEL
SPL
MRL
NSSAL
CEL
SPL
MRL
NSSAL
CEL
SPL
MRL
NSSAL
CEL
SPL
MRL
NSSAL
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10 hits@10
Fig. 33. Impact of the training approach on the performance for a fixed interaction model and loss function for the WN18RR dataset (results are
aggregated based on the medians). BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss, MRL to the margin ranking loss,
NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed world assumption training approach
and sLCWA to the stochastic local closed world assumption training approach.
ALI et al. A17
MRL
NSSAL
SPL
CEL
MRL
NSSAL
SPL
CEL
MRL
NSSAL
SPL
CEL
MRL
NSSAL
SPL
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
hits@10 hits@10 hits@10 hits@10
Fig. 34. Impact of the training approach on the performance for a fixed interaction model and loss function for the FB15K-237 dataset (results are
aggregated based on the medians). BCEL refers to the binary cross entropy loss, CEL to the cross entropy loss, MRL to the margin ranking loss,
NSSAL refers to the negative sampling self-adversarial loss, SPL to the softplus loss, LCWA to the local closed world assumption training approach
and sLCWA to the stochastic local closed world assumption training approach.