Original article was published on Artificial Intelligence on Medium
KG Embeddings: Hyperbolic and Hyper-relational
Hyperbolic spaces are among recent hot topics in ML. Refraining from 🤯, in simpler terms, in a hyperbolic space 🔮 (thanks to its properties) you can represent hierarchies and tree-like structures more efficiently and at the same time use fewer dimensions!
With that motivation, Chami et al propose AttH, a hyperbolic KG embedding algorithm that uses rotations, reflections, and translations to model logical and hierarchical patterns in the KG. Att comes from the hyperbolic attention that is applied to rotated and reflected vectors. 🎩The trick to bypass shaky Riemannian optimization is to use tangent spaces, to which every point of the d-dimensional Poincare ball can be mapped. In this obviously non-trivial setup, each relation is associated not only to one vector, but also to the parameters that describe relation-specific reflections and rotations. Still, in real-world KGs
R << V , so the overhead is not that big.
⚖️ In the experiments, AttH performs especially well on WN18RR and Yago 3–10 that exhibit some hierarchical structure, with lesser margins on FB15k-237. More importantly, just 32-dimensional AttH shows huge margins compared to 32-d models in real and complex planes. Moreover, 32-d scores are just 0.02–0.03 MRR points smaller than SOTA 500-d embedding models on WN18RR and FB15k-237. Ablation studies demonstrate the importance of having learnable curvatures whereas its closest match, MurP, has them fixed.
Another growing trend in graph representation learning is to go beyond simple KGs consisting of triples and learn representations for more complex, hyper-relational KGs (as coined in the work by Rosso et al), when every triple might have a set of key-value attribute pairs that give fine-grained details on the validity of the triple in various contexts. In fact, Wikidata adopts the hyper-relational model in its Wikidata Statement Model where attributes are called qualifiers. It’s important not to mix the model with n-ary facts (that generate redundant predicates) and hypergraphs. That is, if you work with Wikidata only on the triple level, you lose a good half of the content 😃
The idea of NeuInfer is to compute a validity and compatibility score of a hyper-relational fact (cf the illustration). First,
(h,r,t) embeddings are fed into a fully-connected net (FCN) to estimate the likelihood of this triple (validity). Second, for each key-value pair a quintuple
(h,r,t,k,v) is constructed and passed through another set of FCNs. Having m pairs, m vectors are min-pooled and the result represents the compatibility score, i.e., how well those qualifiers live with the main triple. Finally, the authors use a weighted sum of two scores to get the final score.
The authors evaluate NeuInfer on standard benchmarks JF17K (extracted from Freebase) and WikiPeople (from Wikidata) and report significant improvement in JF17K compared to NaLP when predicting heads, tails, and attribute values. 📈 I would encourage the authors to compare their numbers with HINGE (from Rosso et al) as both approaches are conceptually similar.
💡 And now we need to talk. We need to talk about reproducibility of KG embedding algorithms published even at top conferences like ACL 2019. Sun, Vashishth, Sanyal et al find that several recent KGE models that reported SOTA results (drastically better than existing baselines) suffer from test set leakages, or have unusually many zeroified neurons after ReLU activations scoring valid triples ☢️. Further, they show that their performance metric scores (like Hits@K and MRR) depend on the position of the valid triple among sampled negatives (which should not happen, actually). On the other hand, existing strong baselines perform exactly the same despite any position. The take-away message is to use the evaluation protocol that places a valid triple at a random position among negatives.
[start of a shameless self-promotion 😊] Well, our team also has something to say about this issue: in our new paper “Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework” we performed 65K+ experiments and spent 21K+ GPU hours evaluating 19 models spanning from RESCAL first published in 2011 to the late 2019’s RotatE and TuckER, 5 loss functions, various training strategies with/without negative sampling, and many more hyper-parameters that turn out to be important to consider. We are also releasing the best found hyperparameters for all the models for you folks and our beloved community 🤗. In addition, we are releasing PyKEEN 1.0, a PyTorch library for training and benchmarking KG embeddings models! [end of the self-promotion]
🔥 Several other works I’d encourage you to read thoroughly: Sachan studies the problem of compression of KG entity embeddings by discretization, e.g., “Barack Obama” instead of a 200d float32 vector would be encoded as “2–1–3–3” and “Michelle Obama” as “2–1–3–2”.
That is, you only need a D-long vector of K values (here, D=4, K=3). For discretization, tempered softmax is found to work better. And as a reverse function from the KD code back to N-dimensional vector of floats the author suggests using a simple Bi-LSTM. Experimental results are astonishing 👀 — compression rates for FB15k-237 and WN18RR reach 100–1000x with a negligible (max 2% MRR) performance drop and computation overhead at inference time (when a KD code has to be decoded back). 🤔 I suggest we all sit down for a minute and re-think our KGE pipelines (especially in production scenarios). For example, 200d embeddings of 78M Wikidata entities obtained via PyTorch-BigGraph require 👉 110 GB 👈 of space. Just imagine what would be possible with a gentle 100x compression 😏.
➕ There is also a lineup of works that improve popular KGE models:
* Tang et al generalize RotatE from 2D rotations to high-dimensional spaces with orthogonal relation transforms which works better for 1-N and N-N relations.
* Xu et al generalize bilinear models to multi-linear by chunking dense vectors in K parts. It is then shown that if K=1 the approach is equal to DistMult, if K=2 the approach reduces to ComplEx and HolE, and the authors experiment with K=4 and K=8.
* Xie et al extend ConvE by replacing standard conv filters with those from the Inception network famous in the Computer Vision domain.
* Nguyen et al apply a self-attention style encoder and a CNN decoder for triple classification and search personalization tasks.