Knowledge Base Construction

Original article can be found here (source): Artificial Intelligence on Medium

Knowledge Base Construction

What is KBC?

Knowledge base construction (KBC) is the process of populating a knowledge base by extracting information from unstructured sources such as pdfs, text, images, etc. It is often difficult to directly deal with unstructured data such as documents. KBC systems help us to store information in a format that is easy to use. Once stored in structured format, it becomes easy for applications to use the data that once was previously impossible to use.

For example, suppose you want to build a chatbot that answers all questions related to the Harry Potter series. Or maybe you want to make a quiz bot for kids who have read the series. The earlier approach to build such bots would have been to create extensive question answers. KBC will make the job of creating such software easy. One will have to build a KBC for Harry Potter. Once it is built, creating FAQs or Quizzes will be as easy as creating software. As you read this article you will realize the potential of KBC to transform the way in which such applications are created. This is Artificial Intelligence on an overdrive!

We have already seen in the previous article, “Helplessness of Software”, why it is difficult for software to deal with unstructured data such as text, images and speech. KBCs help such systems by providing them with a knowledge base that can be assessed without any additional resources. KBCs of the future will be available as APIs for other software to use.

General Architecture:

Let us now focus on the general architecture of a KBC system. Implementations may vary across companies working in this domain, but the general principles are valid across. Building a KBC system is difficult as it needs to incorporate multiple processes of information extraction. Each individual process is a system in itself, hence building an entire KBC requires a collaborative effort.

General Architecture of a KBC system

A typical KBC system consists of the following layers

Layer 1: This layer consists of all the necessary elements that helps in extracting information from the data lake. The data lake contains all the dark data, including unstructured text, images, videos, etc.

  • Deep Learning/ Machine Learning: This is used to classify objects into categories. Both Deep Learning and Machine Learning are used to classify and thus find meaning in data. Before Deep Learning, it was difficult to extract information from images. But now it helps in performing object detection.
  • Natural Language Processing (NLP): NLP is used to process and analyse natural language. NLP as a field is already matured and offers a lot of tools to deal with text. NLP tasks such as lemmatization, part-of-speech tagging, parsing, stemming, named entity recognition, relationship extraction are available to make sense out of unstructured text. NLP along with the power of Deep Learning is increasing the accuracy of most of the tasks of NLP.
  • Graph Database: Graph database allows data to be stored as nodes and edges. The edges allow to store the relationship between the nodes. This is a natural mechanism to store data, and helps in writing more natural queries.
  • First Order Logic (FOL): FOL allows a system to think about data the way humans will think. Since FOL allows for a formal way to store sentences, it allows to understand whether they are valid or not. KBC systems try to device probabilistic FOL to expand on the idea of probabilistic reasoning.

Layer 2: This layer contains all the necessary elements to perform reasoning over the data that is stored in graph database and FOL. This layer offers reasoning capabilities over the stored data.

  • Knowledge Representation. Facts are generally stored in a KBC system as triplets (subject-predicate-object). One of the initial efforts in KBC is the Never Ending Language Learner (NELL). NELL is the best example of collaborative learning to extract facts about the world (e.g., playsInstrument(George_Harrison, guitar)). NELL is a continuous learning system. These facts that are stored should be probabilistic. Any AI system is always probabilistic, and this is what any KBC system follows. The reason for the KBC to store facts in terms of probability is that it gathers evidence from different sources to make sure that the fact is accurate. It might happen that different sources give different evidence. For example if initially the system gathers the fact that Harry Potter is a wizard, it will give a low probability. It will keep increasing the probability as it finds new evidence of this fact. So the KBC will store the relationship isa(“Harry Potter”, “Wizard”, 0.65). As of now the system is able to tell with 0.65 probability that Harry Potter is a wizard. Another fact can be best_friends(“Harry Potter”, “Ronald Weasley”, 0.85).

Just imagine the power of storing such facts. Another element of KBC might be able to perform reasoning over these facts. For example, what is the probability that if Harry Potter is a wizard, Ronald Weasley is also a wizard, given the fact that both of them are best friends with a very high probability.

  • Knowledge Query: KBC systems should offer means to query the knowledge. The knowledge query performs probabilistic inference and finds the necessary information. The information extracted using a probabilistic query can be used in many ways such as performing analytics or taking a decision.
  • Knowledge learning: Any KBC should do incremental learning. It should start from basic knowledge and build up its knowledge as the system learns from new data. Since KBC is probabilistic it should continuously update and correct its knowledge. It should not just learn new facts but validate its old facts based on the new knowledge. For example, if our Harry Potter KBC has devised certain facts from reading the first novel, it should update its knowledge as it reads new novels.

Layer 3: This is an application layer of any KBC system. All kinds of applications can be created in this layer that interacts with layer 2. Few of the applications that can harness KBC are as follows —

  • Chabots: Once KBC converts unstructured data to structured data, it becomes easy to design chatbots that can harness the knowledge and answer questions the way humans do. Just imagine a chatbot that can answer all your questions regarding Harry Potter. You will feel like conversing with a friend about Harry Potter.
  • Expert System: Before KBC, expert systems suffered from not being able to process dark data. Expert Systems who are able to take advantage of KBC will be very useful in healthcare where there is a lot of dark data.
  • Search: It becomes easy to find information when it is structured. Reasoning over knowledge bases gives the ability to find correct information. Imagine you are searching for “Who is the most knowledgeable student in Harry Potter?”. Now if this fact is not directly evident in the series, how will a search engine find the answer where this fact is indirectly presented? KBC systems will be able to help in reasoning over such facts and thus help in searching.
  • Diagnosis: Help desks of IT and hardware companies will be able to better answer customer queries and diagnose problems when they start using the power of KBC. Imagine the wealth of data of past customer queries to reason over. The call centre employees will be easily able to harness the past information to get better insights to solve new queries.

Each of those above mentioned elements deserve an individual article of its own. In future blog posts we will be going in detail how each element works.