The Building a Large-scale, Accurate and Fresh Knowledge Graph

Source: Deep Learning on Medium

Microsoft gives a wonderful tutorial about Knowledge Graph in KDD 2018. If you are a machine learning engineer or an NLP engineer, I highly recommend reading this tutorial. It talks about what is knowledge graph (KG), the KG construction challenges for a large scale, and the approaches for challenges with paper references.

This post is a summary of the tutorial. You can find the slide here.

Part I: Introduction

There are several measurements to evaluating knowledge quality, correctness, coverage, freshness, and usage.

How to ensure correctness, coverage, freshness for a vast KG is a huge challenge. A very common problem is there are multiple entity share the same name, e.g. Will Smith. How to link the information to the correct Will Smith is also a challenge (Entity Linking, EL).

Converting raw data to a high-quality KG mainly contains three steps: extracting data from structured or unstructured data sources, use schema to correlated data and relationships, and conflate the schematized knowledge.

The above figure shows the active research and product efforts related to KG. KG has many research directions. If you want to start learning the KG, I recommend starting from Knowledge Graph Construction, which including some common NLP techniques, Named Entity Recognition (NER), Relation Extraction (RE), End-to-End Relation extraction. The goal of these techniques is to get the triple data. For example, (Will Smith, profession, Actor) is an entity-attribute-value triple data. (Will Smith, couple, Jada Pinkett Smith) is an entity-relation-entity triple data.

Part II: Acquiring Knowledge in the Wild

This part list a lot of papers. I just take some of them. If you find anything you are interested in, I recommend reading the tutorial directly.

We can get the extracted knowledge from numerous data sources, which mainly contains two kinds, structured sources, and unstructured sources. The amount of structured sources is limited, so we need to extract knowledge by many NLP techniques, like NER, RE, etc.

In part II, the slides list up many papers for extracting knowledge from different sources. These papers are related to extract data from web (rule-based, tree-based, machine-learning-based), from news and forums, from email & calendars, from social media.

In order to increase the coverage, it also lists some paper related to NER, Relation Extraction, Entity Linking, knowledge base (KB) embedding for KB completion.

There are also some works about verify knowledge.

Besides the above content, it also contains human intervention papers related to Distant supervision (DS) and crowdsourcing.

Part III: Building Knowledge Graph

This part introduces how Microsoft builds Satori Graph. The whole process mainly has four phases.

Phase 1: Data Ingestion

Data Ingestion including using parsing and standardization to store data in a uniform manner, and mapping extracted data to Microsoft Ontology.

Phase 2: Match & Merge

This graph shows that the ingestion flow in the second phase, Match & Merge.

The biggest problem in ingestion flow is the entity matching, identity and discover instances referring to the same real-world entity, and the data quality challenge, missing data may be caused by human information extraction tech or human errors.

In order to detect matched entities, the author introduces several approaches. You can find the recommended paper for each approach in the slides.

Phase 3: Knowledge Refinement

This phase is to eliminate the conflicting facts and connections after phase 2, which mainly including the error detection and fact inference.

Phase 4: Publishing & Serve

This phase is the independent part IV below.

Part IV: Serving Knowledge to the World

This part mainly talks about how to serve KG for QA.

There are mainly three challenges, different languages, large search space, and compositionality.

There are two main approaches, the semantic parsing approach, and the information extraction approach.

Check out my other blogs here!