CCPA, PII and NLP

Source: Deep Learning on Medium

Named Entity Recognition (NER)

NER is the task of identifying things like names, organizations, locations, dates/times etc. NER is can used to identify some of the personal information contained within text data.

Example- John moved from Arizona to California in December. 
Above sentence has below three entities -
Person → JohnLocation → Arizona and CaliforniaDate/Time → December

So one can train a deep learning model to classify each word in a sentence either as one of the named entities or not. However, there are several libraries which come with pre-trained models for NER task. Identifying the PII information is very important as it can help in fast retrieval of such information, properly securing the information (by encryption etc) and also controlling access to such information. Below are some of the libraries which can be used for NER →

Stanford Core NLP

The Stanford core NLP is a popular NLP library written in Java and comes with pre-trained for various NLP tasks like POS (Part of Speech), NER etc for English and several other languages. There is also a newer project StanfordNLP which is a python library which currently supports POS, lemmatization etc but doesn’t support NER.

Spacy

Spacy is a very popular NLP library with some of the core components written in C which makes it very fast. It also comes with pre-trained models for NER and other tasks for English and several other languages. Spacy has different size models to choose depending on your applications. The models are trained on the OntoNotes 5 dataset. Here I am using the largest English language model — en_core_web_lg (let’s look at how to use Spacy for NER →

Source

After recognizing the entities we can mask them (in the above example with ‘xxxx’ to get “My name is xxxx xxxx , I live in xxxx”) to preserve the user’s personal information. Spacy pre-trained models based on Ontonotes 5 can recognize below types of entities →

Spacy Ontonotes 5 Entities — Source

For information like phone numbers, email etc which spacy cannot recognize regex can be used. Other than masking the PII information can also be encrypted to further protect it.