SpaCy

Source: Deep Learning on Medium

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text.

Installation:-

$ pip install -U spacy

Download statistical models

Predict part-of-speech tags, dependency labels, named entities and more. See here for available models.

$ python -m spacy download en_core_web_sm

Check that your installed models are up to date

$ python -m spacy validate

Features:-

NER:-Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

extract text adata

Different Approaches to Information Extraction

In the previous section, we managed to easily extract triples from a few sentences. However, in the real world, the data size is huge and manual extraction of structured information is not feasible. Therefore, automating this information extraction becomes important.

There are multiple approaches to perform information extraction automatically. Let’s understand them one-by-one:

  1. Rule-based Approach: We define a set of rules for the syntax and other grammatical properties of a natural language and then use these rules to extract information from text
  2. Supervised: Let’s say we have a sentence S. It has two entities E1 and E2. Now, the supervised machine learning model has to detect whether there is any relation (R) between E1 and E2. So, in a supervised approach, the task of relation extraction turns into the task of relation detection. The only drawback of this approach is that it needs a lot of labeled data to train a model
  3. Semi-supervised: When we don’t have enough labeled data, we can use a set of seed examples (triples) to formulate high-precision patterns that can be used to extract more relations from the text

1. spaCy’s Rule-Based Matching:-

import spacy
# Load the installed model "en_core_web_sm"
nlp = spacy.load("en_core_web_sm")

Processing text

Processing text with the nlp object returns a Doc object that holds all information about the tokens, their linguistic features and their relationships.

# sample text 
text = "GDP in developing countries such as Vietnam will continue growing at a high rate."
doc = nlp(text) # create a spaCy object

Accessing token attributes

# print token, dependency, POS tag 
for tok in doc:
print(tok.text, "-->",tok.dep_,"-->", tok.pos_)
output:GDP --> nsubj --> NOUN
in --> prep --> ADP
developing --> amod --> VERB
countries --> pobj --> NOUN
such --> amod --> ADJ
as --> prep --> ADP
Vietnam --> pobj --> PROPN
will --> aux --> VERB
continue --> ROOT --> VERB
growing --> xcomp --> VERB
at --> prep --> ADP
a --> det --> DET
high --> amod --> ADJ
rate --> pobj --> NOUN
. --> punct --> PUNCT

Have a look around the terms “such” and “as” . They are followed by a noun (“countries”). And after them, we have a proper noun (“Vietnam”) that acts as a hyponym.

So, let’s create the required pattern using the dependency tags and the POS tags:

#define the pattern 
pattern = [{'POS':'NOUN'}, {'LOWER': 'such'},
{'LOWER': 'as'},
{'POS': 'PROPN'} #proper noun]

Output: ‘countries such as Vietnam’

Nice! It works perfectly. However, if we could get “developing countries” instead of just “countries”, then the output would make more sense.

In this article, we learned about Information Extraction, the concept of relations and triples, and different methods for relation extraction