NLP(Natural Language Processing) with TensorFlow

Original article was published on Deep Learning on Medium

NLP(Natural Language Processing) with TensorFlow

A tutorial for learning and practicing NLP with TensorFlow.

Photo by Jr Korpa on Unsplash

Not all the data is present in a standardized form. Data is created when we talk, when we tweet, when we send messages on Whatsapp and in other activities. The majority of this data is in the textual form, which is of a highly unstructured nature.

While having data of high dimensions, the information stored therein is not directly available unless it is manually interpreted (understand by reading) or analyzed by an automated device. It is important to familiarise ourselves with the techniques and principles of Natural Language Processing ( NLP) in order to gain meaningful insights from text data.

So in this article, we will see how we can gain insights into text data and hands-on on how to use those insights to train NLP models and perform some human mimicking tasks. Let’s dive in and look at some of the basics of NLP.

Tokenization:

Representing the words in a way that a computer can process them, with a view to later training a Neural network that can understand their meaning. This process is called tokenization.

Let’s look at how we can tokenize the sentences using TensorFlow tools.

Make sure to follow the comments in code to understand what we are doing.

tokenizing the sentences
Output :
{‘i’: 1, ‘love’: 2, ‘my’: 3, ‘dog’: 4, ‘cat’: 5}

The tokenizer is also smart enough to catch some exceptions. In the next example, we have added a word dog! but the tokenizer is smart enough to not create a new token for “dog!” again.

tokenizer being smart
Output :
{‘love’: 1, ‘my’: 2, ‘i’: 3, ‘dog’: 4, ‘cat’: 5, ‘you’: 6}

Sequencing:

Now that our words are represented like this, next, we need to represent our sentences by a sequence of numbers in the correct order. Then we will have data ready for processing by a neural network to understand or maybe even generate new text. Let’s look at how we can manage this sequencing using TensorFlow tools.

sequencing the sentences
Output :{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

Now we have basic tokenization done. But there is a catch. This is all very well for getting data ready for training a neural network, but what happens when that Neural Network has to classify texts, but there are words that it has never seen before? So this can confuse the Neural Network. Let’s look at how to handle that next.

Let’s try sequencing a sentence, which has words that tokenizer has not seen yet.

testing the tokenizer for unseen words
Output :
[[4, 2, 1, 3], [1, 3, 1]]

Unseen words:

From the above results :

‘i really love my dog’ = [4, 2, 1, 3] i.e. a 5-word sentence ends up as a 4 numbered sequence, why?
Because the word “really” was not in the word index. The corpus used to build the word index doesn’t contain that word.

Similarly ‘my dog loves my manatee’ = [1, 3, 1] i.e. a 5-word sentence ends up as a 3 numbered sequence or it is equivalent to “my dog my” as “loves” and “manatee” are not in word index.

So we can imagine that we need a huge word index to handle sentences that are not in the training set. But in order not to lose the length of the sequence, there is also a little trick that we can use. Let’s take a look at that.

By using the OOV(out of vocabulary) token property, and setting it as something that you would not expect to see in the corpus, like “<OOV>”, this word is never used anywhere, so we can use a word that we can assume never appears in a text. Then the tokenizer will create a token for that and replaces words that it doesn’t recognize with the out of vocabulary token instead. It’s simple but effective. Let’s look at an example.

out of vocabulary tokenization
Output :{‘<OOV>’: 1, ‘my’: 2, ‘love’: 3, ‘dog’: 4, ‘i’: 5, ‘you’: 6, ‘cat’: 7, ‘do’: 8, ‘think’: 9, ‘is’: 10, ‘amazing’: 11}[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Now we can notice that the length of the sentences has been retained and the unseen words in the sentence are replaced by the “<OOV>” token. So the resultant sentences are like :

‘i really love my dog’ = [5, 1, 3, 2, 4] = ‘i <OOV> love my dog’‘my dog loves my manatee’ = [2, 4, 1, 2, 1] =‘my dog <OOV> my <OOV>’

We still lost some meaning, but a lot less and the sentences are of at least the correct lengths. And while it helps to maintain the sequence length to be the same length as the sentence, we might wonder, when it comes to needing to train a Neural Network, how can it handle sentences of different lengths?

With images, they are all usually the same size. So how would we solve that problem?

Padding the sequences:

A simple solution is padding. For this, we ill use pad_sequences imported for the sequence module of tensorflow.keras.preprocessing. As the name suggests, we can use it to pad our sequence. So we just need to pass sequences to pad_sequence function and the rest is done for us.

padding the sequences
Output:{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]][[ 0 0 0 5 3 2 4] 
[ 0 0 0 5 3 2 7]
[ 0 0 0 6 3 2 4]
[ 8 6 9 2 4 10 11]]

So our first example [5, 3, 2, 4] is preceded by 3 zeros in the padded sequence. But why 3 zeros? Well, it’s because our longest sentence has 7 words in it, so we pass this corpus sequence to pad sequence, it measures that and ensures that all of the sentences have equally-sized sequences by padding them with zero’s at the front. Note that OOV is 0, it is not 1.

Now we might think that we don’t want zero’s in the front, but instead after the sentence. Well, that’s easy. We can just the padding parameter to “post” i.e padding = “post”.

Or if we don’t want the length of the padded sentences to be the same as the longest sentence, we can then specify the desired length by specifying the “maxlen” parameter to the required length. But wait? we might think what happens if the sentences are longer than the “maxlen” parameter?

Well, then we can specify hot to truncate the sentence whether by chopping off the words at the end, with a post truncation or from the beginning with a pre-truncation. Please refer to pad_sequences documentation for other options.

The function pad_Sequences might then look like :

padded = pad_sequences(sequences,maxlen = 5, padding=’post’, truncating = ‘post’)

Till now we have seen how to tokenize text into numeric values, and use tools in TensorFlow to regularize and pad that text. Now that we’ve gotten pre-processing out of the way, we can next look at how to build a classifier to recognize sentiment in text.

We’ll start by using a dataset of News headlines, where the headlines have been categorized as sarcastic or not. We’ll train a classifier on this and it can then tell us that if a new piece of text looks like it might be sarcastic or not.

This dataset has 3 fields:

is_sarcastic field: “1” if sarcastic and 0 otherwise.
headline: the headline if the news article
article_link: link to the original news article.

The data is stored in JSON format and we will convert it to python list format for training. Python has a JSON toolkit that can achieve this.

reading JSON files using python

Splitting data to train and test sets:

Now that we have 3 lists one with our labels, one with the text, and one with the URLs, we start doing our familiar steps tokenizing and sequencing the words. But calling tokenizer.fit_on_texts with the headlines, we ‘ll create tokens for every word in the corpus and we’ll see them in the word index.

Now there’s a problem here. We don’t have a split in the data for training and testing. We just have a list of 26,09 sequences. Fortunately, python makes it super easy for us to slice this up.

splitting the data into train and test sets

There is a problem here, remember we tokenize every word to create a word index of every word in the set? But if we really want to test its effectiveness, we have to ensure that the neural network only sees the training data and that it never sees the test data.

So we have to make sure that tokenizer just fits the training data. Let’s do that now.

tokenizing training and test sets

Word Embeddings:

But you might be wondering that we’ve turned our sentences to numbers, with numbers being tokens representing the words. But how do we get meaning from that? How do we determine if something is sarcastic just from the numbers?

Well, here’s where the context of embeddings comes in.

Let’s consider the most basic of sentiments, good and bad. We can often see these as being opposites, so we can plot them as having opposite directions as shown in the below image.

So then what happens with a word like “meh”? it’s not particularly good, and it is not particularly bad. Probably a little more bad than good. So we can plot it somewhere near the bad line. Or the phrase “not bad” which is usually meant to plot something as having a little bit of goodness, but not necessarily very good. So this plot can be inclined towards the good line.

image by author

Now imagine plotting this on the X and Y axis, then we can start to determine the good or bad sentiment as the coordinates in the X and Y as shown in the image(image not to scale). Similarly, we can represent “meh” and good as points in the XY plane.

So by looking at the direction of the vector, we can start to determine the meaning of the word. So what if we can extend that into multiple dimensions instead of just two? What if words that are labeled with sentiments, like sarcastic and not sarcastic, are plotted in a multi-dimensional space. And then as we train, we try to learn what the direction in these multi-dimensional spaces should look like. Words that appear only in the sarcastic sentences will have a strong component in the sarcastic direction and vice versa.

As we load more and more sentences into the network for training, these directions can change. And when we have a fully trained network and give it a set of words, it could look up the vectors for these words, sum them up, and thus give us an idea for the sentiment. This concept is known as embedding.

Now let’s take a look at how we can do this using the TensorFlow embedding layer.

building and training the model
results after training for 10 epochs

We can notice we achieved some good accuracy with training data, but as we can see the val_accuracy is decreasing, which is some classic overfishing. So we can either add less learning rate to our model or train for less number of epochs.

So for 5 epochs, we achieved 92% accuracy, more importantly, with the test data, that is words that the network has never seen, we still got around 85% accuracy which is pretty good.

Establishing Sentiment:
Now let us see how we can use this model to establish sentiment for unseen sentences.

testing for new sentences

Output:
[[0.72842506] [0.05671594]]

0.728 indicates that the first sentence has 72% cahnce of being a sarcastic sentence, so it is classified as a sarcastic one, whereas 0.056 indicates that the second sentence is very close to a non-sarcastic one.

Hurray, now we have built our first text-classification model to understand the sentiment in text. Give it a try and build more complicated sentiment analysis models.

Full code as a Jupiter notebook is available in my GitHub.

Make sure to follow me in order to get updates on my articles.

Happy Learning!