Re-evaluation of Knowledge Graph Completion Methods

Original article can be found here (source): Artificial Intelligence on Medium

Re-evaluation of Knowledge Graph Completion Methods

Despite their large sizes, knowledge graphs are far from complete. Knowledge graph completion methods that aim to add missing facts to a knowledge graph have become a hot topic. Among them, knowledge graph representation learning models or embedding models are extensively studied with new embedding models being proposed rapidly. We did an experimental study to evaluate these methods and our paper will appear in SIGMOD 2020. Briefly speaking, these are the results we obtained:

  • The data redundancy and test leakage in widely-used benchmark datasets caused an overestimation of the accuracy in the range of 19% -175% for many models.
  • We identified Cartesian product relations which also lead to unrealistic evaluation performance.
  • Many test cases (e.g., 70% in one dataset) used to evaluate the models are unrealistic and nonexistent in real-world scenarios.

Knowledge Graphs

Knowledge graphs are rapidly growing in popularity. They store real-world facts in the form of triples. For example, the fact that Bill Gates founded Microsoft will be stated as (Bill Gates, founded, Microsoft). Companies and corporations such as Google, Microsoft, LinkedIn, Siemens, Thomson Reuters, etc. are exploiting knowledge graphs to manage and easily navigate highly connected, multi-relational data.

Gartner hype cycle for emerging technologies of 2019 has also highlighted knowledge graphs as one of the emerging technologies with a significant impact on business, society, and people over the next 5 to 10 years. A brief look at the big conferences such as ACL 2019, NeurIPS 2019, and AAAI 2020 also shows the growth in knowledge graph-related researches.

What’s our paper about?

Our paper investigates the true effectiveness of knowledge graph completion methods and the defects that exist in extensively-used benchmark datasets FB15k (a subset of Freebase) and WN18 (extracted from WordNet) as well as the recent dataset YAGO3–10 (a subset of YAGO3). These datasets that have been used to train and evaluate many of the embedding models contain a large number of reverse and duplicate triples. Our paper shows how data redundancy and test leakage existing in these datasets affect the embedding models. Another problem that we looked at is the existence of Cartesian product relations in FB15k. I will talk about these relations at the end of this post.

The bottom line is that all of the mentioned problems that exist with the datasets cause an unrealistic inflate to models’ accuracy. Moreover, training a knowledge graph completion model using these datasets is a form of overfitting and the learned model is optimized for the aforementioned triples which cannot be generalized to realistic settings.

Embedding models and the datasets used to train them

Embedding models learn the multi-dimensional representations h, r, and t of a triple (head entity (subject), relation, tail entity(object)) in a knowledge graph. Triples will be denoted as(h,r,t) in the rest of this post.

As we know, datasets play an important role in training a machine learning model. In the case of knowledge graph embedding models, the datasets that were widely used to train and test them have different problems. As a result, we have models that won’t be effective when used to do real-world knowledge graph completion.

FB15k & FB15k-237

FB15k contains many reverse triples. It includes many pairs of (h,r,t) and (h,r⁻¹,t) where r and r⁻¹ are reverse relations:

(avatar, film/directed_by, James Cameron)

(James Cameron, director/film, Avatar)

Freebase actually denotes reverse relations explicitly using a special relation reverse_property:

(film/directed_by, reverse_property, director/film)

About 70% of triples in the training set of FB15k form reverse pairs and, also for 70% of triples in the test set of FB15k, their reverse triples exist in the training set.

These data characteristics suggest that embedding models would have been biased toward learning reverse relations for link prediction. More specifically, the task can largely become inferring whether two relations r₁ and r₂ form a reverse pair. Given the abundant reverse triples in the dataset, this goal could potentially be achieved without using a machine learning approach based on complex embeddings of entities and relations. Instead, we can derive simple rules of the form (h,r₁,t)⇒ (t,r₂,h) using statistics about the triples in the dataset. In fact, we generated such a simple model that attained 71.6% for FB15k using FHits@1↑, a common accuracy measure for embedding models (the higher the better). Based on our results, the best performing embedding model has a FHits@1↑ of 73.8% on FB15K.

link prediction scenario, given such data, is non-existent in the real-world.

It’s important to notice that link prediction scenario, given such data, is non-existent in the real-world at all. With regard to FB15k, the redundant reverse relations, coming from Freebase, were just artificially created. New facts were always added to Freebase as a pair of reverse triples, denoted explicitly by the relation reverse_property. For such intrinsically reverse relations that always come in pair, we never need to predict a triple while its reverse is already in the knowledge graph. Training a knowledge graph completion model using FB15k is thus a form of overfitting in that the learned model is optimized for the reverse triples which cannot be generalized to realistic settings.

Toutanova and Chen noted the aforementioned problem with FB15k and created FB1k-237 by removing such redundancy. To investigate the effect of redundant data available in FB15k, we did some experiments to compare the results of several embedding models on FB15k vs. FB15k-237 and the following table displays the results for some of the methods, using different popular metrics. It is worth mentioning that by definition, higher Hits@1↑ (FHits@1↑), Hits@10↑ (FHits@10↑) and MRR↑ (FMRR↑), and lower MR↓ (FMR↓) indicate better accuracy.