Low-Dimensional Hyperbolic Knowledge Graph Embeddings

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Low-Dimensional Hyperbolic Knowledge Graph Embeddings

Ines Chami1∗, Adva Wolf1 , Da-Cheng Juan2 , Frederic Sala1 , Sujith Ravi3† and Christopher Ré1
1
Stanford University
2
Google Research
3
Amazon Alexa
{chami,advaw,fredsala,chrismre}@cs.stanford.edu
dacheng@google.com
sravi@sravi.org

Abstract Movie director


Steven
Movie
Spielberg

Knowledge graph (KG) embeddings learn low- Actor r


cto
dire
arXiv:2005.00545v1 [cs.LG] 1 May 2020

dimensional representations of entities and re- Singer


lations to predict missing facts. KGs often ex- Jurassic
Park
E.T

hibit hierarchical and logical patterns which

es
es
must be preserved in the embedding space.

ur
ur

at
at
fe

fe
For hierarchical data, hyperbolic embedding Ben married Laura Jeff Sam Dee Henry Drew
Harper Dern Goldblum Neill Wallace Thomas Barrymore
methods have shown promise for high-fidelity
and parsimonious representations. However,
Figure 1: A toy example showing how KGs can simul-
existing hyperbolic embedding methods do
taneously exhibit hierarchies and logical patterns.
not account for the rich logical patterns in
KGs. In this work, we introduce a class
of hyperbolic KG embedding models that si-
multaneously capture hierarchical and logi-
methods learn representations of entities and re-
cal patterns. Our approach combines hyper- lationships that preserve the information found in
bolic reflections and rotations with attention the graph, and have achieved promising results for
to model complex relational patterns. Exper- many tasks.
imental results on standard KG benchmarks Relations found in KGs have differing properties:
show that our method improves over previ- for example, (Michelle Obama, married to, Barack
ous Euclidean- and hyperbolic-based efforts
Obama) is symmetric, whereas hypernym relations
by up to 6.1% in mean reciprocal rank (MRR)
in low dimensions. Furthermore, we observe
like (cat, specific type of, feline), are not (Figure
that different geometric transformations cap- 1). These distinctions present a challenge to em-
ture different types of relations while attention- bedding methods: preserving each type of behavior
based transformations generalize to multiple requires producing a different geometric pattern
relations. In high dimensions, our approach in the embedding space. One popular approach
yields new state-of-the-art MRRs of 49.6% on is to use extremely high-dimensional embeddings,
WN18RR and 57.7% on YAGO3-10. which offer more flexibility for such patterns. How-
ever, given the large number of entities found in
1 Introduction
KGs, doing so yields very high memory costs.
Knowledge graphs (KGs), consisting of (head en- For hierarchical data, hyperbolic geometry of-
tity, relationship, tail entity) triples, are popular fers an exciting approach to learn low-dimensional
data structures for representing factual knowledge embeddings while preserving latent hierarchies.
to be queried and used in downstream applications Hyperbolic space can embed trees with arbitrarily
such as word sense disambiguation, question an- low distortion in just two dimensions. Recent re-
swering, and information extraction. Real-world search has proposed embedding hierarchical graphs
KGs such as Yago (Suchanek et al., 2007) or Word- into these spaces instead of conventional Euclidean
net (Miller, 1995) are usually incomplete, so a com- space (Nickel and Kiela, 2017; Sala et al., 2018).
mon approach to predicting missing links in KGs However, these works focus on embedding simpler
is via embedding into vector spaces. Embedding graphs (e.g., weighted trees) and cannot express

Work partially done during an internship at Google. the diverse and complex relationships in KGs.

Work done while at Google AI. We propose a new hyperbolic embedding ap-
proach that captures such patterns to achieve the 2 Related Work
best of both worlds. Our proposed approach pro-
duces the parsimonious representations offered by Previous methods for KG embeddings also rely
hyperbolic space, especially suitable for hierar- on geometric properties. Improvements have been
chical relations, and is effective even with low- obtained by exploiting either more sophisticated
dimensional embeddings. It also uses rich trans- spaces (e.g., going from Euclidean to complex or
formations to encode logical patterns in KGs, pre- hyperbolic space) or more sophisticated operations
viously only defined in Euclidean space. To ac- (e.g., from translations to isometries, or to learning
complish this, we (1) train hyperbolic embeddings graph neural networks). In contrast, our approach
with relation-specific curvatures to preserve mul- takes a step forward in both directions.
tiple hierarchies in KGs; (2) parameterize hyper-
Euclidean embeddings In the past decade, there
bolic isometries (distance-preserving operations)
has been a rich literature on Euclidean embeddings
and leverage their geometric properties to capture
for KG representation learning. These include
relations’ logical patterns, such as symmetry or
translation approaches (Bordes et al., 2013; Ji et al.,
anti-symmetry; (3) and use a notion of hyperbolic
2015; Wang et al., 2014; Lin et al., 2015) or tensor
attention to combine geometric operators and cap-
factorization methods such as RESCAL (Nickel
ture multiple logical patterns.
et al., 2011) or DistMult (Yang et al., 2015). While
We evaluate the performance of our approach, these methods are fairly simple and have few pa-
ATT H, on the KG link prediction task using the rameters, they fail to encode important logical prop-
standard WN18RR (Dettmers et al., 2018; Bordes erties (e.g., translations can’t encode symmetry).
et al., 2013), FB15k-237 (Toutanova and Chen,
2015) and YAGO3-10 (Mahdisoltani et al., 2013) Complex embeddings Recently, there has been
benchmarks. (1) In low (32) dimensions, we im- interest in learning embeddings in complex space,
prove over Euclidean-based models by up to 6.1% as in the ComplEx (Trouillon et al., 2016) and Ro-
in the mean reciprocical rank (MRR) metric. In par- tatE (Sun et al., 2019) models. RotatE learns ro-
ticular, we find that hierarchical relationships, such tations in complex space, which are very effective
as WordNet’s hypernym and member meronym, sig- in capturing logical properties such as symmetry,
nificantly benefit from hyperbolic space; we ob- anti-symmetry, composition or inversion. The re-
serve a 16% to 24% relative improvement versus cent QuatE model (Zhang et al., 2019) learns KG
Euclidean baselines. (2) We find that geometric embeddings using quaternions. However, a down-
properties of hyperbolic isometries directly map to side is that these embeddings require very high-
logical properties of relationships. We study sym- dimensional spaces, leading to high memory costs.
metric and anti-symmetric patterns and find that
reflections capture symmetric relations while rota- Deep neural networks Another family of meth-
tions capture anti-symmetry. (3) We show that ods uses neural networks to produce KG embed-
attention based-transformations have the ability dings. For instance, R-GCN (Schlichtkrull et al.,
to generalize to multiple logical patterns. For in- 2018) extends graph neural networks to the multi-
stance, we observe that ATT H recovers reflections relational setting by adding a relation-specific ag-
for symmetric relations and rotations for the anti- gregation step. ConvE and ConvKB (Dettmers
symmetric ones. et al., 2018; Nguyen et al., 2018) leverage the ex-
pressiveness of convolutional neural networks to
In high (500) dimensions, we find that both hy- learn entity embeddings and relation embeddings.
perbolic and Euclidean embeddings achieve similar More recently, the KBGAT (Nathani et al., 2019)
performance, and our approach achieves new state- and A2N (Bansal et al., 2019) models use graph
of-the-art results (SotA), obtaining 49.6% MRR attention networks for knowledge graph embed-
on WN18RR and 57.7% YAGO3-10. Our exper- dings. A downside of these methods is that they
iments show that trainable curvature is critical to are computationally expensive as they usually re-
generalize hyperbolic embedding methods to high- quire pre-trained KG embeddings as input for the
dimensions. Finally, we visualize embeddings neural network.
learned in hyperbolic spaces and show that hyper-
bolic geometry effectively preserves hierarchies in Hyperbolic embeddings To the best of our
KGs. knowledge, MuRP (Balažević et al., 2019) is the
only method that learns KG embeddings in hy- Tx M
perbolic space in order to target hierarchical data.
MuRP minimizes hyperbolic distances between x
v
a re-scaled version of the head entity embedding
expx(v)
and a translation of the tail entity embedding. It M
achieves promising results using hyperbolic em-
beddings with fewer dimensions than its Euclidean
analogues. However, MuRP is a translation model Figure 2: An illustration of the exponential map
and fails to encode some logical properties of rela- expx (v), which maps the tangent space Tx M at the
tionships. Furthermore, embeddings are learned in point x to the hyperbolic manifold M .
a hyperbolic space with fixed curvature, potentially
leading to insufficient precision, and training relies dimensional Poincaré ball model with negative cur-
on cumbersome Riemannian optimization. Instead, vature −c (c > 0): Bd,c = {x ∈ Rd : ||x||2 < 1c },
our proposed method leverages expressive hyper- where || · || denotes the L2 norm. For each point
bolic isometries to simultaneously capture logical x ∈ Bd,c , the tangent space Txc is a d-dimensional
patterns and hierarchies. Furthermore, embeddings vector space containing all possible directions of
are learned using tangent space (i.e., Euclidean) op- paths in Bd,c leaving from x.
timization methods and trainable hyperbolic curva- The tangent space Txc maps to Bd,c via the ex-
tures per relationship, avoiding precision errors that ponential map (Figure 2), and conversely, the log-
might arise when using a fixed curvature, and pro- arithmic map maps Bd,c to Txc . In particular, we
viding flexibility to encode multiple hierarchies. have closed-form expressions for these maps at the
origin:
3 Problem Formulation and Background
√ v
We describe the KG embedding problem setting expc0 (v) = tanh( c||v||) √ , (1)
c||v||
and give some necessary background on hyperbolic √ y
geometry. logc0 (y) = arctanh( c||y||) √ . (2)
c||y||
3.1 Knowledge graph embeddings Vector addition is not well-defined in the hyper-
bolic space (adding two points in the Poincaré ball
In the KG embedding problem, we are given a set
might result in a point outside the ball). Instead,
of triples (h, r, t) ∈ E ⊆ V × R × V, where V and
Möbius addition ⊕c (Ganea et al., 2018) provides
R are entity and relationship sets, respectively. The
an analogue to Euclidean addition for hyperbolic
goal is to map entities v ∈ V to embeddings ev ∈
space. We give its closed-form expression in Ap-
U dV and relationships r ∈ R to embeddings rr ∈
pendix A.1. Finally, the hyperbolic distance on
U dR , for some choice of space U (traditionally R),
Bd,c has the explicit formula:
such that the KG structure is preserved.
Concretely, the data is split into ET rain and ET est 2 √
dc (x, y) = √ arctanh( c|| − x ⊕c y||). (3)
triples. Embeddings are learned by optimizing a c
scoring function s : V × R × V → R, which
measures triples’ likelihoods. s(·, ·, ·) is trained 4 Methodology
using triples in ET rain and the learned embeddings The goal of this work is to learn parsimonious hy-
are then used to predict scores for triples in ET est . perbolic embeddings that can encode complex log-
The goal is to learn embeddings such that the scores ical patterns such as symmetry, anti-symmetry, or
of triples in ET est are high compared to triples that inversion while preserving latent hierarchies. Our
are not present in E. model, ATT H, (1) learns KG embeddings in hyper-
bolic space in order to preserve hierarchies (Sec-
3.2 Hyperbolic geometry tion 4.1), (2) uses a class of hyperbolic isometries
We briefly review key notions from hyperbolic ge- parameterized by compositions of Givens transfor-
ometry; a more in-depth treatment is available in mations to encode logical patterns (Section 4.2),
standard texts (Robbin and Salamon). Hyperbolic (3) combines these isometries with hyperbolic at-
geometry is a non-Euclidean geometry with con- tention (Section 4.3). We describe the full model
stant negative curvature. In this work, we use the d- in Section 4.4.
4.1 Hierarchies in hyperbolic space
0 0
As described, hyperbolic embeddings enable us 0 0

to represent hierarchies even when we limit our-


selves to low-dimensional spaces. In fact, two-
(a) Rotations (b) Reflections
dimensional hyperbolic space can represent any
tree with arbitrarily small error (Sala et al., 2018). Figure 3: Euclidean (left) and hyperbolic (right) isome-
It is important to set the curvature of the hy- tries. In hyperbolic space, the distance between start
perbolic space correctly. This parameter provides and end points after applying rotations or reflections is
flexibility to the model, as it determines whether much larger than the Euclidean distance; it approaches
to embed relations into a more curved hyperbolic the sum of the distances between the points and the ori-
gin, giving more “room” to separate embeddings. This
space (more “tree-like”), or into a flatter, more
is similar to trees, where the shortest path between two
“Euclidean-like” geometry. For each relation, we points goes through their nearest common ancestor.
learn a relation-specific absolute curvature cr , en-
abling us to represent a variety of hierarchies. As
we show in Section 5.5, fixing, rather than learn- directly infer (Bob, sibling of, Alice) from (Alice,
ing curvatures can lead to significant performance sibling of, Bob) and vice versa.
degradation. Parameterization Unlike RotatE which models
rotations via unitary complex numbers, we learn
4.2 Hyperbolic isometries
relationship-specific isometries using Givens trans-
Relationships often satisfy particular properties, formations, 2 × 2 matrices commonly used in nu-
such as symmetry: e.g., if (Michelle Obama, merical linear algebra. Let Θr := (θr,i )i∈{1,... d }
2
married to, Barack Obama) holds, then (Barack and Φr := (φr,i )i∈{1,... d } denote relation-specific
Obama, married to, Michelle Obama) does as well. 2
parameters. Using an even number of dimensions d,
These rules are not universal. For instance, (Barack
our model parameterizes rotations and reflections
Obama, born in, Hawaii) is not symmetric.
with block-diagonal matrices of the form:
Creating and curating a set of deterministic rules
is infeasible for large-scale KGs; instead, embed- Rot(Θr ) = diag(G+ (θr,1 ), . . . , G+ (θr, d )), (4)
2
ding methods represent relations as parameterized
Ref(Φr ) = diag(G− (φr,1 ), . . . , G− (φr, n2 )), (5)
geometric operations that directly map to logical  
properties. We use two such operations in hyper- ± cos(θ) ∓sin(θ)
where G (θ) := . (6)
bolic space: rotations, which effectively capture sin(θ) ±cos(θ)
compositions or anti-symmetric patterns, and reflec-
Rotations and reflections of this form are hyper-
tions, which naturally encode symmetric patterns.
bolic isometries (distance-preserving). We can
Rotations Rotations have been successfully used therefore directly apply them to hyperbolic embed-
to encode compositions in complex space with the dings while preserving the underlying geometry.
RotatE model (Sun et al., 2019); we lift these to Additionally, these transformations are computa-
hyperbolic space. Compared to translations or ten- tionally efficient and can be computed in linear time
sor factorization approaches which can only infer in the dimension. We illustrate two-dimensional
some logical patterns, rotations can simultaneously isometries in both Euclidean and hyperbolic spaces
model and infer inversion, composition, symmetric in Figure 3.
or anti-symmetric patterns. 4.3 Hyperbolic attention
Reflections These isometries reflect along a fixed Of our two classes of hyperbolic isometries, one or
subspace. While some rotations can represent sym- the other may better represent a particular relation.
metric relations (more specifically π−rotations), To handle this, we use an attention mechanism to
any reflection can naturally represent symmetric learn the right isometry. Thus we can represent
relations, since their second power is the identity. symmetric, anti-symmetric or mixed-behaviour re-
They provide a way to fill-in missing entries in lations (i.e. neither symmetric nor anti-symmetric)
symmetric triples, by applying the same operation as a combination of rotations and reflections.
to both the tail and the head entity. For instance, Let xH and yH be hyperbolic points (e.g., re-
by modelling sibling of with a reflection, we can flection and rotation embeddings), and a be an
attention vector. Our approach maps hyperbolic Dataset #entities #relations #triples ξG
WN18RR 41k 11 93k -2.54
representations to tangent space representations, FB15k-237 15k 237 310k -0.65
xE = logc0 (xH ) and yE = logc0 (yH ), and com- YAGO3-10 123k 37 1M -0.54
putes attention scores:
Table 1: Datasets statistics. The lower the metric ξG is,
the more tree-like the knowledge graph is.
(αx , αy ) = Softmax(aT xE , aT yE ).

We then compute a weighted average using the 5 Experiments


recently proposed tangent space average (Chami
In low dimensions, we hypothesize (1) that hyper-
et al., 2019; Liu et al., 2019):
bolic embedding methods obtain better represen-
tations and allow for improved downstream per-
Att(xH , yH ; a) := expc0 (αx xE + αy yE ). (7) formance for hierarchical data (Section 5.2). (2)
We expect the performance of relation-specific ge-
4.4 The ATT H model ometric operations to vary based on the relation’s
We have all of the building blocks for ATT H, and logical patterns (Section 5.3). (3) In cases where
can now describe the model architecture. Let the relations are neither purely symmetric nor anti-
(eH H symmetric, we anticipate that hyperbolic attention
v )v∈V and (rr )r∈R denote entity and relation-
ship hyperbolic embeddings respectively. For a outperforms the models which are based on solely
triple (h, r, t) ∈ V × R × V, ATT H applies reflections or rotations (Section 5.4). Finally, in
relation-specific rotations (Equation 4) and reflec- high dimensions, we expect hyperbolic models
tions (Equation 5) to the head embedding: with trainable curvature to learn the best geometry,
and perform similarly to their Euclidean analogues
(Section 5.5).
qH H H H
Rot = Rot(Θr )eh , qref = Ref(Φr )eh . (8)
5.1 Experimental setup
ATT H then combines the two representations using
Datasets We evaluate our approach on the link
hyperbolic attention (Equation 7) and applies a
prediction task using three standard competition
hyperbolic translation:
benchmarks, namely WN18RR (Bordes et al.,
2013; Dettmers et al., 2018), FB15k-237 (Bor-
Q(h, r) = Att(qH H cr H
Rot , qRef ; ar ) ⊕ rr . (9)
des et al., 2013; Toutanova and Chen, 2015) and
YAGO3-10 (Mahdisoltani et al., 2013). WN18RR
Intuitively, rotations and reflections encode log- is a subset of WordNet containing 11 lexical re-
ical patterns while translations capture tree-like lationships between 40,943 word senses, and has
structures by moving between levels of the hierar- a natural hierarchical structure, e.g., (car, hyper-
chy. Finally, query embeddings are compared to nym of, sedan). FB15k-237 is a subset of Free-
target tail embeddings via the hyperbolic distance base, a collaborative KB of general world knowl-
(Equation 3). The resulting scoring function is: edge. FB15k-237 has 14,541 entities and 237 re-
lationships, some of which are non-hierarchical,
2
s(h, r, t) = −dcr (Q(h, r), eH
t ) + bh + bt , (10) such as born-in or nationality, while others have
natural hierarchies, such as part-of (for organiza-
where (bv )v∈V are entity biases which act as mar- tions). YAGO3-10 is a subset of YAGO3, contain-
gins in the scoring function (Tifrea et al., 2019; ing 123,182 entities and 37 relations, where most
Balažević et al., 2019). relations provide descriptions of people. Some re-
The model parameters are then lationships have a hierarchical structure such as
{(Θr , Φr , rH
r , a r , cr )r∈R , (eH, b )
v v v∈V }. Note playsFor or actedIn, while others induce logical
that the total number of parameters in ATT H is patterns, like isMarriedTo.
O(|V|d), similar to traditional models that do not For each KG, we follow the standard data aug-
use attention or geometric operations. The extra mentation protocol by adding inverse relations
cost is proportional to the number of relations, (Lacroix et al., 2018) to the datasets. Addition-
which is usually much smaller than the number of ally, we estimate the global graph curvature ξG (Gu
entities. et al., 2019) (see Appendix A.2 for more details),
WN18RR FB15k-237 YAGO3-10
U Model MRR H@1 H@3 H@10 MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
d RotatE .387 .330 .417 .491 .290 .208 .316 .458 - - - -
R
MuRE .458 .421 .471 .525 .313 .226 .340 .489 .283 .187 .317 .478
Cd ComplEx-N3 .420 .390 .420 .460 .294 .211 .322 .463 .336 .259 .367 .484
Bd,1 MuRP .465 .420 .484 .544 .323 .235 .353 .501 .230 .150 .247 .392
R EF E .455 .419 .470 .521 .302 .216 .330 .474 .370 .289 .403 .527
Rd ROT E .463 .426 .477 .529 .307 .220 .337 .482 .381 .295 .417 .548
ATT E .456 .419 .471 .526 .311 .223 .339 .488 .374 .290 .410 .537
R EF H .447 .408 .464 .518 .312 .224 .342 .489 .381 .302 .415 .530
Bd,c ROT H .472 .428 .490 .553 .314 .223 .346 .497 .393 .307 .435 559
ATT H .466 .419 .484 .551 .324 .236 .354 .501 .397 .310 .437 .566

Table 2: Link prediction results for low-dimensional embeddings (d = 32) in the filtered setting. Best score in bold
and best published underlined. Hyperbolic isometries significantly outperform Euclidean baselines on WN18RR
and YAGO3-10, both of which exhibit hierarchical structures.

Mean Reciprocical Rank (MRR) vs. dimension Evaluation metrics At test time, we use the scor-
0.50
ing function in Equation 10 to rank the correct tail
0.45
or head entity against all possible entities, and use
0.40 in use inverse relations for head prediction (Lacroix
MRR

0.35 et al., 2018). Similar to previous work, we compute


MurP
0.30 ComplEx − N3
two ranking-based metrics: (1) mean reciprocal
RotH rank (MRR), which measures the mean of inverse
0.25
101 102 ranks assigned to correct entities, and (2) hits at
Embedding dimension
K (H@K, K ∈ {1, 3, 10}), which measures the
Figure 4: WN18RR MRR dimension for d ∈ proportion of correct triples among the top K pre-
{10, 16, 20, 32, 50, 200, 500}. Average and standard dicted triples. We follow the standard evaluation
deviation computed over 10 runs for ROT H. protocol in the filtered setting (Bordes et al., 2013):
all true triples in the KG are filtered out during
evaluation, since predicting a low rank for these
which is a distance-based measure of how close a
triples should not be penalized.
given graph is to being a tree. We summarize the
datasets’ statistics in Table 1. Training procedure and implementation We
train ATT H by minimizing the full cross-entropy
Baselines We compare our method to SotA mod-
loss with uniform negative sampling, where neg-
els, including MurP (Balazevic et al., 2019), MurE
ative examples for a triple (h, r, t) are sampled
(which is the Euclidean analogue or MurP), RotatE
uniformly from all possible triples obtained by per-
(Sun et al., 2019), ComplEx-N3 (Lacroix et al.,
turbing the tail entity:
2018) and TuckER (Balazevic et al., 2019). Base-
X
line numbers in high dimensions (Table 5) are taken L= log(1+exp(yt0 s(h, r, t0 ))), (11)
from the original papers, while baseline numbers in t0 ∼U (V)
the low-dimensional setting (Table 2) are computed (
using open-source implementations of each model. −1 if t0 = t
where yt0 =
In particular, we run hyper-parameter searches over 1 otherwise.
the same parameters as the ones in the original
Since optimization in hyperbolic space is practi-
papers to compute baseline numbers in the low-
cally challenging, we instead define all parameters
dimensional setting.
in the tangent space at the origin, optimize embed-
Ablations To analyze the benefits of hyperbolic dings using standard Euclidean techniques, and use
geometry, we evaluate the performance of ATT E, the exponential map to recover the hyperbolic pa-
which is equivalent to ATT H with curvatures set rameters (Chami et al., 2019). We provide more
to zero. Additionally, to better understand the details on tangent space optimization in Appendix
role of attention, we report scores for variants of A.4. We conducted a grid search to select the learn-
ATT E/H using only rotations (ROT E/H) or reflec- ing rate, optimizer, negative sample size, and batch
tions (R EF E/H). size, using the validation set to select the best hy-
Relation KhsG ξG ROT E ROT H Improvement Relation Anti-symmetric Symmetric ROT H R EF H ATT H
member meronym 1.00 -2.90 .320 .399 24.7% hasNeighbor 7 3 .750 1.00 1.00
hypernym 1.00 -2.46 .237 .276 16.5% isMarriedTo 7 3 .941 .941 1.00
has part 1.00 -1.43 .291 .346 18.9% actedIn 3 7 .145 .110 .150
instance hypernym 1.00 -0.82 .488 .520 6.56% hasMusicalRole 3 7 .431 .375 .458
member of domain region 1.00 -0.78 .385 .365 -5.19% directed 3 7 .500 .450 .567
member of domain usage 1.00 -0.74 .458 .438 -4.37% graduatedFrom 3 7 .262 .167 .274
synset domain topic of 0.99 -0.69 .425 .447 5.17% playsFor 3 7 .671 .642 .664
also see 0.36 -2.09 .634 .705 11.2% wroteMusicFor 3 7 .281 .188 .266
derivationally related form 0.07 -3.84 .960 .968 0.83% hasCapital 3 7 .692 .731 .731
similar to 0.07 -1.00 1.00 1.00 0.00% dealsWith 7 7 .286 .286 .429
verb group 0.07 -0.50 .974 .974 0.00% isLocatedIn 7 7 .404 .399 .420

Table 3: Comparison of H@10 for WN18RR relations. Table 4: Comparison of geometric transformations on
Higher KhsG and lower ξG means more hierarchical. a subset of YAGO3-10 relations.

perparameters. Our best model hyperparameters (KhsG ) (Balažević et al., 2019) and estimated cur-
are detailed in Appendix A.3. We conducted all vature per relation (see Appendix A.2 for more
our experiments on NVIDIA Tesla P100 GPUs and details). We consider a relation to be hierarchical
make our implementation publicly available∗ . when its corresponding graph is close to tree-like
(low curvature, high KhsG ). We observe that hyper-
5.2 Results in low dimensions bolic embeddings offer much better performance
We first evaluate our approach in the low- on hierarchical relations such as hypernym or has
dimensional setting for d = 32, which is approxi- part, while Euclidean and hyperbolic embeddings
mately one order of magnitude smaller than SotA have similar performance on non-hierarchical rela-
Euclidean methods. Table 2 compares the perfor- tions such as verb group. We also plot the learned
mance of ATT H to that of other baselines, includ- curvature per relation versus the embedding dimen-
ing the recent hyperbolic (but not rotation-based) sion in Figure 5b. We note that the learned curva-
MuRP model. In low dimensions, hyperbolic ture in low dimensions directly correlates with the
embeddings offer much better representations for estimated graph curvature ξG in Table 3, suggesting
hierarchical relations, confirming our hypothesis. that the model with learned curvatures learns more
ATT H improves over previous Euclidean and hy- “curved” embedding spaces for tree-like relations.
perbolic methods by 0.7% and 6.1% points in MRR Finally, we observe that MurP achieves lower
on WN18RR and YAGO3-10 respectively. Both performance than MurE on YAGO3-10, while
datasets have multiple hierarchical relationships, ATT H improves over ATT E by 2.3% in MRR. This
suggesting that the hierarchical structure imposed suggests that trainable curvature is critical to learn
by hyperbolic geometry leads to better embeddings. embeddings with the right amount of curvature,
On FB15k-237, ATT H and MurP achieve similar while fixed curvature might degrade performance.
performance, both improving over Euclidean base- We elaborate further on this point in Section 5.5.
lines. We conjecture that translations are sufficient
to model relational patterns in FB15k-237. 5.3 Hyperbolic rotations and reflections
To understand the role of dimensionality, we In our experiments, we find that rotations work well
also conduct experiments on WN18RR against on WN18RR, which contains multiple hierarchi-
SotA methods under varied low-dimensional set- cal and anti-symmetric relations, while reflections
tings (Figure 4). We include error bars for our work better for YAGO3-10 (Table 5). To better
method with average MRR and standard deviation understand the mechanisms behind these observa-
computed over 10 runs. Our approach consistently tions, we analyze two specific patterns: relation
outperforms all baselines, suggesting that hyper- symmetry and anti-symmetry. We report perfor-
bolic embeddings still attain high-accuracy across mance per-relation on a subset of YAGO3-10 re-
a broad range of dimensions. lations in Table 4. We categorize relations into
Additionally, we measure performance per re- symmetric, anti-symmetric, or neither symmetric
lation on WN18RR in Table 3 to understand the nor anti-symmetric categories using data statistics.
benefits of hyperbolic geometric on hierarchical re- More concretely, we consider a relation to satisfy a
lations. We report the Krackhardt hierarchy score logical pattern when the logical condition is satis-

fied by most of the triplets (e.g., a relation r is sym-
Code available at https://github.com/
tensorflow/neural-structured-learning/ metric if for most KG triples (h, r, t), (t, r, h) is
tree/master/research/kg_hyp_emb also in the KG). We observe that reflections encode
0.6 Mean Reciprocal Rank (MRR) vs. dimension 3.0 Absolute curvature per relation vs. dimension
hypernym instance hypernym
Low dimensions High dimensions
0.5 2.5 synset domain topic of member meronym

Absolute curvature
has part member of domain usage
2.0 derivationally related form similar to
0.4 also see member of domain region
1.5
MRR

verb group
0.3
1.0
0.2 RotE (Zero curvature) 0.5
0.1 RotH (Fixed curvature)
0.0
RotH (Trainable curvatures)
0.0 1 2 3 −0.5
10 10 10 101 102
Embedding dimension Embedding dimension

(a) MRR for fixed and trainable curvatures on WN18RR. (b) Curvatures learned by with ROT H on WN18RR.

Figure 5: (a): ROT H offers improved performance in low dimensions; in high dimensions, fixed curvature degrades
performance, while trainable curvature approximately recovers Euclidean space. (b): As the dimension increases,
the learned curvature of hierarchical relationships tends to zero.

symmetric relations particularly well, while rota- both Euclidean and hyperbolic spaces have enough
tions are well suited for anti-symmetric relations. capacity to represent complex hierarchies in KGs.
This confirms our intuition—and the motivation for This is further supported by Figure 5b, which
our approach—that particular geometric properties shows the learned absolute curvature versus the
capture different kinds of logical properties. dimension. We observe that curvatures are close to
zero in high dimensions, confirming our expecta-
5.4 Attention-based transformations tion that ROT H with trainable curvatures learns a
One advantage of using relation-specific transfor- roughly Euclidean geometry in this setting.
mations is that each relation can learn the right In contrast, fixed curvature degrades perfor-
geometric operators based on the logical properties mance in high dimensions (Figure 5a), confirming
it has to satisfy. In particular, we observe that in the importance of trainable curvatures and its im-
both low- and high-dimensional settings, attention- pact on precision and capacity (previously studied
based models can recover the performance of the by (Sala et al., 2018)). Additionally, we show the
best transformation on all datasets (Tables 2 and 5). embeddings’ norms distribution in the Appendix
Additionally, per-relationship results on YAGO3- (Figure 7). Fixed curvature results in embeddings
10 in Table 4 suggest that ATT H indeed recovers being clustered near the boundary of the ball while
the best geometric operation. trainable curvatures adjusts the embedding space
Furthermore, for relations that are neither sym- to better distribute points throughout the ball. Pre-
metric nor anti-symmetric, we find that ATT H cision issues that might arise with fixed curvature
can outperform rotations and reflections, suggest- could also explain MurP’s low performance in high
ing that combining multiple operators with atten- dimensions. Trainable curvatures allow ROT H to
tion can learn more expressive operators to model perform as well or better than previous methods in
mixed logical patterns. In other words, attention- both low and high dimensions.
based transformations alleviate the need to conduct
experiments with multiple geometric transforma- 5.6 Visualizations
tions by simply allowing the model to choose which In Figure 6, we visualize the embeddings learned
one is best for a given relation. by ROT E versus ROT H for a sub-tree of the or-
ganism entity in WN18RR. To better visualize the
5.5 Results in high dimensions hierarchy, we apply k inverse rotations for all nodes
In high dimensions (Table 5), we compare against at level k in the tree.
a variety of other models and achieve new SotA By contrast to ROT E, ROT H preserves the tree
results on WN18RR and YAGO3-10, and third- structure in the embedding space. Furthermore, we
best results on FB15k-237. As we expected, when note that ROT E cannot simultaneously preserve the
the embedding dimension is large, Euclidean and tree structure and make non-neighboring nodes far
hyperbolic embedding methods perform similarly from each other. For instance, virus should be far
across all datasets. We explain this behavior by not- from male, but preserving the tree structure (by
ing that when the dimension is sufficiently large, going one level down in the tree) while making
WN18RR FB15k-237 YAGO3-10
U Model MRR H@1 H@3 H@10 MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
DistMult .430 .390 .440 .490 .241 .155 .263 .419 .340 .240 .380 .540
ConvE .430 .400 .440 .520 .325 .237 .356 .501 .440 .350 .490 .620
Rd
TuckER .470 .443 .482 .526 .358 .266 .394 .544 - - - -
MurE .475 .436 .487 .554 .336 .245 .370 .521 .532 .444 .584 .694
ComplEx-N3 .480 .435 .495 .572 .357 .264 .392 .547 .569 .498 .609 .701
Cd
RotatE .476 .428 .492 .571 .338 .241 .375 .533 .495 .402 .550 .670
Hd Quaternion .488 .438 .508 .582 .348 .248 .382 .550 - - - -
Bd,1 MurP .481 .440 .495 .566 .335 .243 .367 .518 .354 .249 .400 567
R EF E .473 .430 .485 .561 .351 .256 .390 .541 .577 .503 .621 .712
Rd ROT E .494 .446 .512 .585 .346 .251 .381 .538 .574 .498 .621 .711
ATT E .490 .443 .508 .581 .351 .255 .386 .543 .575 .500 .621 .709
R EF H .461 .404 .485 .568 .346 .252 .383 .536 .576 .502 .619 .711
Bd,c ROT H .496 .449 .514 .586 .344 .246 .380 .535 .570 .495 .612 .706
ATT H .486 .443 .499 .573 .348 .252 .384 .540 .568 .493 .612 .702

Table 5: Link prediction results for high-dimensional embeddings (best for d ∈ {200, 400, 500}) in the filtered
setting. DistMult, ConvE and ComplEx results are taken from (Dettmers et al., 2018). Best score in bold and
best published underlined. ATT E and ATT H have similar performance in the high-dimensional setting, performing
competitively with or better than state-of-the-art methods on WN18RR, FB15k-237 and YAGO3-10.

virus
chical structures. Future directions for this work in-
protoctist
microbe
virus
microbe

protoctist
microorganism
clude exploring other tasks that might benefit from
microorganism
larva
young
potplant organism
hyperbolic geometry, such as hypernym detection.
potplant
plantlife
poisonousplant
poisonousplant
young
fauna The proposed attention-based transformations can
plantlife
male
female
larva
female
also be extended to other geometric operations.
fauna male
organism

Acknowledgements
(a) ROT E embeddings. (b) ROT H embeddings.
We thank Avner May for their helpful feedback
Figure 6: Visualizations of the embeddings learned by
and discussions. We gratefully acknowledge the
ROT E and ROT H on a sub-tree of WN18RR for the hy-
pernym relation. In contrast to ROT E, ROT H preserves support of DARPA under Nos. FA86501827865
hierarchies by learning tree-like embeddings. (SDH) and FA86501827882 (ASED); NIH under
No. U54EB020405 (Mobilize), NSF under Nos.
CCF1763315 (Beyond Sparsity), CCF1563078
these two nodes far from each other is difficult in (Volume to Velocity), and 1937301 (RTML); ONR
Euclidean space. In hyperbolic space, however, we under No. N000141712266 (Unifying Weak Super-
observe that going one level down in the tree is vision); the Moore Foundation, NXP, Xilinx, LETI-
achieved by translating embeddings towards the CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC,
left. This pattern essentially illustrates the transla- ARM, Hitachi, BASF, Accenture, Ericsson, Qual-
tion component in ROT H, allowing the model to comm, Analog Devices, the Okawa Foundation,
simultaneously preserve hierarchies while making American Family Insurance, Google Cloud, Swiss
non-neighbouring nodes far from each other. Re, the HAI-AWS Cloud Credits for Research
program, TOTAL, and members of the Stanford
6 Conclusion
DAWN project: Teradata, Facebook, Google, Ant
We introduce ATT H, a hyperbolic KG embed- Financial, NEC, VMWare, and Infosys. The U.S.
ding model that leverages the expressiveness of Government is authorized to reproduce and dis-
hyperbolic space and attention-based geometric tribute reprints for Governmental purposes notwith-
transformations to learn improved KG representa- standing any copyright notation thereon. Any opin-
tions in low-dimensions. ATT H learns embeddings ions, findings, and conclusions or recommenda-
with trainable hyperbolic curvatures, allowing it tions expressed in this material are those of the
to learn the right geometry for each relationship authors and do not necessarily reflect the views,
and generalize across multiple embedding dimen- policies, or endorsements, either expressed or im-
sions. ATT H achieves new SotA on WN18RR and plied, of DARPA, NIH, ONR, or the U.S. Govern-
YAGO3-10, real-world KGs which exhibit hierar- ment.
References David Krackhardt. 1994. Graph theoretical dimensions
of informal organizations. In Computational organi-
Ivana Balažević, Carl Allen, and Timothy Hospedales. zation theory, pages 107–130. Psychology Press.
2019. Multi-relational poincaré graph embeddings.
In Advances in Neural Information Processing Sys- Timothée Lacroix, Nicolas Usunier, and Guillaume
tems, pages 4465–4475. Obozinski. 2018. Canonical tensor decomposition
for knowledge base completion. International Con-
Ivana Balazevic, Carl Allen, and Timothy Hospedales. ference on Machine Learning.
2019. Tucker: Tensor factorization for knowledge
graph completion. In Proceedings of the 2019 Con- Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu,
ference on Empirical Methods in Natural Language and Xuan Zhu. 2015. Learning entity and relation
Processing and the 9th International Joint Confer- embeddings for knowledge graph completion. In
ence on Natural Language Processing (EMNLP- Twenty-ninth AAAI Conference on Artificial Intelli-
IJCNLP), pages 5188–5197. gence.
Trapit Bansal, Da-Cheng Juan, Sujith Ravi, and An- Qi Liu, Maximilian Nickel, and Douwe Kiela. 2019.
drew McCallum. 2019. A2n: Attending to neigh- Hyperbolic graph neural networks. In Advances
bors for knowledge graph inference. In Proceedings in Neural Information Processing Systems, pages
of the 57th Annual Meeting of the Association for 8228–8239.
Computational Linguistics, pages 4387–4392.
Farzaneh Mahdisoltani, Joanna Biega, and Fabian M
Silvere Bonnabel. 2013. Stochastic gradient descent
Suchanek. 2013. Yago3: A knowledge base from
on Riemannian manifolds. IEEE Transactions on
multilingual wikipedias.
Automatic Control, 58(9):2217–2229.
George A Miller. 1995. Wordnet: a lexical database for
Antoine Bordes, Nicolas Usunier, Alberto Garcia-
english. Communications of the ACM, 38(11):39–
Duran, Jason Weston, and Oksana Yakhnenko.
41.
2013. Translating embeddings for modeling multi-
relational data. In Advances in Neural Information Deepak Nathani, Jatin Chauhan, Charu Sharma, and
Processing Systems, pages 2787–2795. Manohar Kaul. 2019. Learning attention-based
Ines Chami, Zhitao Ying, Christopher Ré, and Jure embeddings for relation prediction in knowledge
Leskovec. 2019. Hyperbolic graph convolutional graphs. In Proceedings of the 57th Annual Meet-
neural networks. In Advances in Neural Information ing of the Association for Computational Linguistics.
Processing Systems, pages 4869–4880. Association for Computational Linguistics.

Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc
and Sebastian Riedel. 2018. Convolutional 2D Nguyen, and Dinh Phung. 2018. A Novel Embed-
knowledge graph embeddings. In Thirty-Second ding Model for Knowledge Base Completion Based
AAAI Conference on Artificial Intelligence. on Convolutional Neural Network. In Proceed-
ings of the 16th Annual Conference of the North
John Duchi, Elad Hazan, and Yoram Singer. 2011. American Chapter of the Association for Computa-
Adaptive subgradient methods for online learning tional Linguistics: Human Language Technologies
and stochastic optimization. Journal of Machine (NAACL-HLT), pages 327–333.
Learning Research, 12(Jul):2121–2159.
Maximilian Nickel, Volker Tresp, and Hans-Peter
Octavian Ganea, Gary Bécigneul, and Thomas Hof- Kriegel. 2011. A three-way model for collective
mann. 2018. Hyperbolic neural networks. In Ad- learning on multi-relational data. In International
vances in Neural Information Processing Systems. Conference on Machine Learning, pages 809–816.
Omnipress.
Albert Gu, Fred Sala, Beliz Gunel, and Christopher
Ré. 2019. Learning mixed-curvature representations Maximillian Nickel and Douwe Kiela. 2017. Poincaré
in product spaces. In International Conference on embeddings for learning hierarchical representa-
Learning Representations. tions. In Advances in Neural Information Process-
ing Systems, pages 6338–6347.
Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and
Jun Zhao. 2015. Knowledge graph embedding via Joel W Robbin and Dietmar A Salamon. Introduction
dynamic mapping matrix. In Proceedings of the to differential geometry.
53rd Annual Meeting of the Association for Compu-
tational Linguistics and the 7th International Joint Frederic Sala, Chris De Sa, Albert Gu, and Christopher
Conference on Natural Language Processing (Vol- Ré. 2018. Representation tradeoffs for hyperbolic
ume 1: Long Papers), pages 687–696. embeddings. In International Conference on Ma-
chine Learning, pages 4457–4466.
Diederik P Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In International Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,
Conference for Learning Representations. Rianne Van Den Berg, Ivan Titov, and Max Welling.
2018. Modeling relational data with graph convolu-
tional networks. In European Semantic Web Confer-
ence, pages 593–607. Springer.
Fabian M Suchanek, Gjergji Kasneci, and Gerhard
Weikum. 2007. Yago: a core of semantic knowledge.
In Proceedings of the 16th international conference
on World Wide Web, pages 697–706. ACM.
Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian
Tang. 2019. Rotate: Knowledge graph embedding
by relational rotation in complex space. In Interna-
tional Conference on Learning Representations.
Alexandru Tifrea, Gary Bécigneul, and Octavian-
Eugen Ganea. 2019. Poincaré GloVe: Hyperbolic
word embeddings. In International Conference on
Learning Representations.
Kristina Toutanova and Danqi Chen. 2015. Observed
versus latent features for knowledge base and text
inference. In Proceedings of the 3rd Workshop on
Continuous Vector Space Models and their Compo-
sitionality, pages 57–66.

Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric


Gaussier, and Guillaume Bouchard. 2016. Com-
plex embeddings for simple link prediction. In In-
ternational Conference on Machine Learning, pages
2071–2080.
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng
Chen. 2014. Knowledge graph embedding by trans-
lating on hyperplanes. In Twenty-Eighth AAAI Con-
ference on Artificial Intelligence.
Canran Xu and Ruijiang Li. 2019. Relation embed-
ding with dihedral group in knowledge graph. In
Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics. Association
for Computational Linguistics.
Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng
Gao, and Li Deng. 2015. Embedding entities and
relations for learning and inference in knowledge
bases. In International Conference on Learning Rep-
resentations.
Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. 2019.
Quaternion knowledge graph embeddings. In Ad-
vances in Neural Information Processing Systems,
pages 2731–2741.
Fixed curvature Trainable curvature
A Appendix 4500 2500
4000
3500 2000
Below, we provide additional details. We start by 3000
1500
providing the formula for the hyperbolic analogue 2500
2000
of addition that we use, along with additional hy- 1000
1500
perbolic geometry background. Next, we provide 1000 500
more information about the metrics that are used 500
0 0
to determine how hierarchical a dataset is. Af- 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8

terwards, we give additional experimental details,


Figure 7: Histogram of embeddings norm learned with
including the table of hyperparameters and further fixed and trainable curvatures for the hypernym relation
details on tangent space optimization. Lastly, we in WN18RR.
include an additional comparison against the Dihe-
dral model (Xu and Li, 2019).
which is given by:
A.1 Möbius addition
1
The Möbius addition operation (Ganea et al., 2018) ξGr (a, b, c) = dGr (a, m)2
2dGr (a, m)
has the closed-form expression:
+ dGr (b, c)2 /4
− (dGr (a, b)2 + dGr (a, c)2 )/2 ,

c αxy x + βxy y
x⊕ y = ,
1 + 2cxT y + c2 ||x||2 ||y||2
where αxy = 1 + 2cxT y + c||y||2 , where m is the midpoint of the shortest path con-
necting b to c. This estimate is positive for triangles
and βxy = 1 − c||x||2 .
in circles, negative for triangles in trees, and zero
for triangles in lines. Moreover, for a triangle in
In contrast to Euclidean addition, it is neither com- a Riemannian manifold M , ξM (a, b, c) estimates
mutative nor associative. However, it provides the sectional curvature of the plane on which the
an analogue through the lens of parallel transport: triangle lies (see (Gu et al., 2019) for more de-
given two points x, y and a vector v in Txc , there is tails). Let mr be the total number of connected
a unique vector in Tyc which creates the same angle components in Gr . We sample 1000 wi,r triangles
as v with the direction of the geodesic (shortest from each connected component ci,r of Gr where
path) connecting x to y. This map is the paral- N3
c wi,r = Pmri,r 3 , and Ni,r is the number of nodes
lel transport Px→y (·); Euclidean parallel transport i=1 Ni,r

is the standard Euclidean addition. Analogously, in the component ci,r . ξGr is the mean of the es-
the Möbius addition satisfies (Ganea et al., 2018): timated curvatures of the sampled triangles. For
x ⊕c y = expcx (P0→x
c (logc0 (y))). the full graph, we take the weighted average of the
relation curvatures ξGr with respect to the weights
Pmr 3
i=1 Ni,r
A.2 Hierarchy estimates P P mr
N 3 .
r i=1 i,r

We use two metrics to estimate how hierarchical a


relation is: the curvature estimate ξG and the Krack- Krackhardt hierarchy score For the directed
hardt hierarchy score KhsG . While the curvature graph Gr spanned by the relation r, we let R be
estimate captures global hierarchical behaviours the adjacency matrix (Ri,j = 1 if there is an edge
(how much the graph is tree-like when zooming- from node i to node j and 0 otherwise). Then:
out), the Krackhardt score captures a more local Pn
behaviour (how many small loops the graph has). i,j=1 Ri,j (1 − Rj,i )
KhsGr = Pn .
See Figure 8 for examples. i,j=1 Ri,j

Curvature estimate To estimate the curvature See (Krackhardt, 1994) for more details. We
of a relation r, we restrict to the undirected graph note that for fully observed symmetric relations
Gr spanned by the edges labeled as r. Following (each edge is in a two-edge loop), KhsGr = 0
(Gu et al., 2019), let ξGr (a, b, c) be the curvature while for anti-symmetric relations (no small loops),
estimate of a triangle in Gr with vertices {a, b, c}, KhsGr = 1.
ξG < 0, KhsG = 1 ξG < 0, KhsG = 0

ξG = 0, KhsG = 1 ξG = 0, KhsG = 0

ξG > 0, KhsG = 1 ξG > 0, KhsG = 0


Figure 8: The curvature estimate ξG and the Krackhardt hierarchy score KhsG for several simple graphs. The
top-left graph is the most hierarchical, while the bottom-right graph is the least hierarchical.

WN18RR FB15k-237 YAGO3-10


Model MRR H@10 MRR H@10 MRR H@10
(Bonnabel, 2013), as was used in MuRP. RSGD
Dihedral .486 557 .300 .496 .388 .573 is challenging in practice. Instead, we use tangent
ATT E .490 .581 .351 .543 .575 .709 space optimization (Chami et al., 2019). We de-
Table 6: Comparison of Dihedral and ATT E in high- fine all the ATT H parameters in the tangent space
dimensions. at the origin (our parameter space), optimize em-
beddings using standard Euclidean techniques, and
use the exponential map to recover the hyperbolic
A.3 Experimental details parameters.
For all our Euclidean and hyperbolic models, we Note that tangent space optimization is an exact
conduct a hyperparameter search for the learning procedure, which does not incur losses in repre-
rate, optimizer (Adam (Kingma and Ba, 2015) or sentational power. This is the case in hyperbolic
Adagrad (Duchi et al., 2011)), negative sample size space specifically because of a completeness prop-
and batch size. We train each model for 500 epochs erty: there is always a global bijection between the
and use early stopping after 100 epochs if the vali- tangent space and the manifold.
dation MRR stops increasing. We report the best Concretely, ATT H optimizes the entity and rela-
hyperparameters for each dataset in Table 7. tionship embeddings (eE E
v )v∈V and (rr )r∈R , which
are mapped to the Poincaré ball with:
A.4 Tangent space optimization cr E cr E
eH H
v = exp0 (ev ) and rr = exp0 (rr ), (12)
Optimization in hyperbolic space normally requires
Riemannian Stochastic Gradient Descent (RSGD) The trainable model parameters are then
Dataset embedding dimension model learning rate optimizer batch size negative samples
R EF E 0.001 Adam 100 250
ROT E 0.001 Adam 100 250
ATT E 0.001 Adam 100 250
32
R EF H 0.0005 Adam 250 250
ROT H 0.0005 Adam 500 50
ATT H 0.0005 Adam 500 50
WN18RR
R EF E 0.1 Adagrad 500 50
ROT E 0.001 Adam 100 500
ATT E 0.001 Adam 1000 50
500
R EF H 0.05 Adagrad 500 50
ROT H 0.001 Adam 1000 50
ATT H 0.001 Adam 1000 50
R EF E 0.075 Adagrad 250 250
ROT E 0.05 Adagrad 500 50
ATT E 0.05 Adagrad 500 50
32
R EF H 0.05 Adagrad 500 250
ROT H 0.1 Adagrad 100 50
ATT H 0.05 Adagrad 500 100
FB15k-237
R EF E 0.05 Adagrad 500 50
ROT E 0.05 Adagrad 100 50
ATT E 0.05 Adagrad 500 50
500
R EF H 0.05 Adagrad 500 50
ROT H 0.05 Adagrad 1000 50
ATT H 0.05 Adagrad 500 50
R EF E 0.005 Adam 2000 NA
ROT E 0.005 Adam 2000 NA
ATT E 0.005 Adam 2000 NA
32
R EF H 0.005 Adam 1000 NA
ROT H 0.001 Adam 1000 NA
ATT H 0.001 Adam 1000 NA
YAGO3-10
R EF E 0.005 Adam 4000 NA
ROT E 0.005 Adam 4000 NA
ATT E 0.005 Adam 2000 NA
500
R EF H 0.001 Adam 1000 NA
ROT H 0.0005 Adam 1000 NA
ATT H 0.0005 Adam 1000 NA

Table 7: Best hyperparameters in low- and high-dimensional settings. NA negative samples indicates that the full
cross-entropy loss is used, without negative sampling.

{(Θr , Φr , rE E
r , ar , cr )r∈R , (ev , bv )v∈V }, which are
all Euclidean parameters that can be learned using
standard Euclidean optimization techniques.

A.5 Comparison to Dihedral


We compare the performance of Dihedral (Xu and
Li, 2019) versus that of ATT E in Table 6. Both
methods combine rotations and reflections, but our
approach learns attention-based transformations,
while Dihedral learns a single parameter to deter-
mine which transformation to use. ATT E signifi-
cantly outperforms Dihedral on all datasets, sug-
gesting that using attention-based representations
is important in order to learn the right geometric
transformation for each relation.

You might also like