Enterprise-grade NER with spaCy

Original article was published by Shubham Saboo on Artificial Intelligence on Medium

Enterprise-grade NER with spaCy

Build Industrial strength Named Entity Recognition (NER) applications within minutes…

spaCy = space/platform agnostic+ Faster compute

Named Entity Recognition is one of most important and widely used NLP task. Its the method of extracting entities (key information) from stack of unstructured or semi-structured data. An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category. For example, an NER model might detect the word “India” in a text and classify it as a “Country”.

Many popular technologies that we use in our day-to-day life such as smart assistants like Siri, Alexa is backed by Named Entity Recognition. Some other real world applications of NER include ticket triage for customer support, resume screening, empowering recommendation engines. Here is an example of NER in action:

Now whether you are new to NLP or have some prior knowledge, spacy has something for everyone. It caters to all ranges of audience starting from bigenner to advance. Now lets understand the what, why and How part of spacy.

What is spaCy?

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) with native support for Python. Its becoming the de-facto choice for data scientists and organizations these days to use pre-trained spacy model for production level NER tasks rather than training a new model from scratch in-house.

If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other? … spaCy is there to answer all your questions

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

spaCy is fast, accurate and user-friendly with a mild learning curve

Speed Comparision of spaCy with its competitors…

Why spaCy?

spacy comes up with its own in-built features and capabilities. It has a collection of pretrained models in many global languages which can be simply installed as a python packages. These packages become the component of application, just like any other module. They’re versioned and can be defined as a dependency in your requirements.txt file.

Following are the features of spacy that sets it way apart from any of its potential competitors:

  • Preprocessing: It consist of a pre-defined tokenizer, lemmatizer, and dependency parser to automatically preprocess the input data.
  • Lingustic Feature: It also have a state-of-the-art Part of Speech tagger that automatically associates POS tags with each word.
  • Visualization: It has the capability of visualizing the dependency trees and create beautiful illustrations for NER task.
Dependency tree visualization..
NER task visualization..
  • Flexibility: It has the flexibility to augment or replace any pipeline component or add new components such as TextCategorizer.
  • Transfer Learning: It provides the user with the feasibility to pick up any pre-trained model and fine-tune it on the downstream tasks.
  • Pipeline: Spacy comes up with an in-built feature for creating a processing pipeline which automates the processing of raw text and generates a spacy recognized doc object, which can be used for a variety of NLP tasks.
spaCy processing pipeline

spaCy in Action

spaCy is available as a standard python library at PyPI, which can be easily installed using either pip or conda depending on python environment. Following are the commands for installing spacy:

spaCy installation via pip
spaCy installation via conda

Now let’s explore how we can efficiently perform named entity recognition with spacy. For that we need to download a pre-trained language model that comes pretty handy with spacy. As we saw earlier spacy supports multiple languages, but we will restrict ourselves to just english language. There are three variants of english language models i.e small, medium, and large that are currently present in spacy.

All of them starts with the prefix en_core_web_* and are loaded with pre-defined tokeniser, tagger, parser and entity recogniser components. As a general trend the accuracy of language model increases with model size. Here we will load the large variant of english language model.

After loading the model into an nlp object which now has a tokeniser, tagger, parser and entity recogniser in its pipeline. The next step is to load the textual data and process it using the different components of the nlp object.

For downstream/domain-specific tasks spacy also provides us with the feasibility to add custom stopwords along with the default stopwords. In spacy the stop words are very easy to identify, where each token has a IS_STOP attribute, which lets us know if word is stopword or not.

Adding custom stopwords

POS- Tagging

Part-of-speech (POS) tagging is the process of tagging a word with its corresponding part-of-speech like noun, adjective, verb, adverb, etc by following the language’s grammatical rules that are further constructed on the basis of the context of occurrence of a word and its relationships with other words in a sentence.

After tokenization SpaCy can tag a given sent object using its state-of-the-art statistical models. The tags are available as an attribute of a Token object. The code below shows tokens and their corresponding P.O.S tags parsed from a given text using SpaCy.


Visualizing Parts-of-Speech

spaCy comes with a built-in dependency visualizer called displacy, which can be used to visualize the syntactic dependency (relationships) between tokens and the entities contained in a text.


Named Entity Recognition:

A named entity is a real-world object with a proper name – for example India , Rafael Nadal, Google. Here India is a country and is identified as GPE (Geopolitical Entity), Rafael Nadal is PER(person), Google is an ORG (Organization). SpaCy itself offers certain predefined set of entities. NER-tagging is not the end result , it end up being helpful for further tasks.


spaCy also comes with a spiffy way of visualing the NER-tagging task using displacy, which provides us with an intutive way to visualize the named entities…