Natural Language Processing with spaCy— Steps and Examples

Original article was published by Dhilip Subramanian on Artificial Intelligence on Medium


Installation

Code

#Installing spaCy library
!pip install -U spaCy

How to load models?

spaCy has different types of pre-trained models. These models enable spaCy to perform several NLP related tasks, such as part-of-speech tagging, tokenization, lemmatization, named entity recognition, dependency parsing, etc. Please check for different types of the model here.

spaCy supports different language models. Here, we use "en_core_web_sm" English model.

Loading the models by using "spacy.load" .

Code

# Importing the spaCy libraryimport spacy# Loading english model and initialize an object called 'nlp'nlp = spacy.load(“en_core_web_sm”)

We loaded the model and initialize an object called ‘nlp’. Whenever you call nlp, it loads the ‘en_core_web_sm’ model. Here, the ‘nlp’ object is a language model instance

Reading a String

Text preprocessing is an important part and crucial step in natural language processing. It transforms the raw text into a number where Machine learning algorithms can perform better.

In the below text example,

Code

text = """ The Republican president is being challenged by Democratic Party nominee Joe Biden, who is best known as Barack Obama’s vice-president but has been in US politics since the 1970s.As election day approaches, pollingcompanies will be trying to gauge the mood of the nation by asking voters which candidate they prefer."""#Passing the text to nlp and initialize an object called 'doc'doc = nlp(text)#Checking the type of doc objecttype(doc)

Output

When the above text is passing to ‘nlp object, spaCy first tokenizes the text to product ‘doc’ object. Then the ‘doc’ object is then processed into several steps such as tagger, parser, ner,etc.. This is also called a processing pipeline. The type of doc object is tokens. We are going to see the steps involved in the pipeline one by one.

https://spacy.io/usage/processing-pipelines

Sentence Detection

Sentence detection is used to identify the start and end of sentences in a given text. This helps to divide the raw texts into a meaningful form and also helps in performing parts of speech tagging and named entity recognition. spaCy use sents attribute to identify the sentences.

Code

#passing the above text example into nlp objectsentence = nlp(text)#Identify the sentences using attributesentences = list(sentence.sents)# Length of the sentencesprint(" The lenght of the sentences:", len(sentences))# Reading the sentencesfor sent in sentences:
print(sent)

Output

The lenght of the sentences: 2 The Republican president is being challenged by Democratic Party nominee Joe Biden, who is best known as Barack Obama’s vice-president but has been in US politics since the 1970s. As election day approaches, polling companies will be trying to gauge the mood of the nation by asking voters which candidate they prefer.

spaCy identified the sentences correctly using delimiter ‘full stop’. Total 2 sentences from the above text example.

Tokenization

Tokenization refers to dividing the text into a sequence of words or sentences. It involves three steps which are breaking a complex sentence into words, understanding the importance of each word with respect to the sentence and finally produce structural description on an input sentence.

Code

Tokenize the doc using attribute token.text

# Tokenization#Sample texttext = “The Republican president is being challenged by Democratic Party nominee Joe Biden,”#passing the text into nlp and store as a doc objectdoc = nlp(text)# Tokenize the doc using token.text attributefor token in doc:
print (token.text)

Output

From the above output, text splits into tokens. The tokens contain punctuation and common words like is, being etc are also called stop words. These punctuations and stop words do not give any meaning to our text. Hence, we need to remove it.

Stopwords

As mentioned above, stop words are the most common words in a language like “at”, “am”, “is”, “above”, “for” etc. These words do not provide any meaning and usually removed from texts.

spaCy has a list of stopwords for each language models. Below code to check the list of stop words in the English language model.

Code

#Checking the stopwords for enlgish language modelsstopwords = spacy.lang.en.stop_words.STOP_WORDS# check the length of stopwordsprint(“The length of stopwords:”, len(stopwords))# Printing the first five stopwordsfor i in list(stopwords)[:5]:
print(i)

Output

How to identify and remove the stopwords from the text?

Identify and remove the stopwords using is_stop attribute class in spaCy.

Code

Identify the stop words

#Printing the stop words from our text examplefor token in doc:
if token.is_stop:
print(token)

Output

Removing the stop words

Code

# Printing total number of tokens in docprint(“Number of tokens in the doc:”, len(doc))# Removing the stopwords from the docdoc2 = []for token in doc:
if not token.is_stop:
doc2.append(token)# Printing total number of tokens in doc after removing stopwordsprint(“Number of tokens after removing stopwords:”, len(doc2))

Output

Punctuations

We can remove the punctuations from the text using is_punct attribute class

Code

#Removing Punctuations# sample texttext = “The Republican president, is being challenged by Democratic Party nominee: Joe Biden,”doc = nlp(text)#Removing the punctuationfor punc in doc:
if not punc.is_punct:
print(punc)

Output

Punctuation was removed from the text.

Lemmatization

Lemmatization is the process of converting a word to its base form. For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’.

Code

# sample texttext = “The Republican president is being challenged by Democratic Party nominee Joe Biden”# Passing the text into spacy model and store as a doc objectdoc = nlp(text)#Lemmatization, printing token and lemmatization side by sidefor token in doc:
print(token, '-->', token.lemma_)

Output

The word ‘is’ converted into ‘be’, being -> be, challenged -> challenge.

spaCy provides various token attributes class. Some of the following common attributes used in the text preprocessing. We can use different attributes based on the dataset. For more attribute class, please check here.

  • is_ascii check if the token consists of ASCII characters or not.
  • is_digit check if the token consists of digits or not.
  • is_lower check if the token in lowercase or not
  • is_uppercase check if the token in uppercase or not
  • text_with_ws prints token text with trailing space (if present).
  • is_alpha check if the token consists of alphabetic characters or not.
  • is_punct check if the token is a punctuation symbol or not.
  • is_space check if the token is a space or not.
  • shape_ prints out the shape of the word.
  • is_stop check if the token is a stop word or not.
  • like_email check if the token consists of an email address or not
  • like_url check if the token consists of URL or not

Parts of Speech Tagging

Part-of-speech tagging is used to assign parts of speech to each word of a given text (such as nouns, verbs, pronouns, adverbs, conjunction, adjectives, interjection) based on its definition and its context.

Parts of Speech tagging can be done in spaCy using token attribute class. tag_ shows the fine-grained part of speech and pos_ shows the coarse-grained part of speech. spacy.explainshows the descriptive details about a particular tag. Please check for more details here.

Code

#Sample texttext = “The Republican president is being challenged by Democratic Party nominee Joe Biden”# Passing the text into spacy model(nlp) and store as a doc objectdoc = nlp(text)# Parts of speech taggingfor token in doc:
print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

Output

The above output shows the parts of speech for all the words with complete descriptive details using spacy.explain. spaCy also provides built-in visualizer called displaCy. It helps us to visualize the POS tags

Code

# Visualizing the POS Tags using displacy#importing displacyfrom spacy import displacy#Passing the docdisplacy.render(doc, style=’dep’, jupyter=True, options={‘distance’: 90})

Output