NLP(Natural Language Processing) with TensorFlow Part-1

Original article was published on Deep Learning on Medium

NLP(Natural Language Processing) with TensorFlow Part-1

A series of tutorials for learning and practicing NLP with TensorFlow.

Photo by Jr Korpa on Unsplash

Not all the data is present in a standardized form. Data is created when we talk, when we tweet, when we send messages on Whatsapp and in other activities. The majority of this data is in the textual form, which is of a highly unstructured nature.

While having data of high dimensions, the information stored therein is not directly available unless it is manually interpreted (understand by reading) or analyzed by an automated device. It is important to familiarise ourselves with the techniques and principles of Natural Language Processing ( NLP) in order to gain meaningful insights from text data.

So in this series of articles, We will see how we can gain insights into text data and hands-on on how to use those insights to train NLP models and perform some human mimicking tasks. Let’s dive in and look at some of the basics of NLP.

Tokenization:

Representing the words in a way that a computer can process them, with a view to later training a Neural network that can understand their meaning. This process is called tokenization.

Let’s look at how we can tokenize the sentences using TensorFlow libraries.

Make sure to follow the comments in code to understand what we are doing.

tokenizing the sentences
Output :
{‘i’: 1, ‘love’: 2, ‘my’: 3, ‘dog’: 4, ‘cat’: 5}

The tokenizer is also smart enough to catch some exceptions. In the next example, we have added a word dog! but the tokenizer is smart enough to not create a new token for “dog!” again.

tokenizer being smart
Output :
{‘love’: 1, ‘my’: 2, ‘i’: 3, ‘dog’: 4, ‘cat’: 5, ‘you’: 6}

Sequencing:

Now that our words are represented like this, next, we need to represent our sentences by a sequence of numbers in the correct order. Then we will have data ready for processing by a neural network to understand or maybe even generate new text. Let’s look at how we can manage this sequencing using TensorFlow tools.

sequencing the sentences
Output :{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10} [[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

Now we have basic tokenization done. But there is a catch. This is all very well for getting data ready for training a neural network, but what happens when that Neural Network has to classify texts, but there are words that it has never seen before? So this can confuse the Neural Network. Let’s look at how to handle that next.

Let’s try sequencing a sentence, which has words that tokenizer has not seen yet.

testing the tokenizer for unseen words
Output :
[[4, 2, 1, 3], [1, 3, 1]]

Unseen words:

From the above results :

‘i really love my dog’ = [4, 2, 1, 3] i.e. a 5-word sentence ends up as a 4 numbered sequence, why?
Because the word “really” was not in the word index. The corpus used to build the word index doesn’t contain that word.

Similarly ‘my dog loves my manatee’ = [1, 3, 1] i.e. a 5-word sentence ends up as a 3 numbered sequence or it is equivalent to “my dog my” as “loves” and “manatee” are not in word index.

So we can imagine that we need a huge word index to handle sentences that are not in the training set. But in order not to lose the length of the sequence, there is also a little trick that we can use. Let’s take a look at that.

By using the OOV(out of vocabulary) token property, and setting it as something that you would not expect to see in the corpus, like “<OOV>”, this word is never used anywhere, so we can use a word that we can assume never appears in a text. Then the tokenizer will create a token for that and replaces words that it doesn’t recognize with the out of vocabulary token instead. It’s simple but effective. Let’s look at an example.

out of vocabulary tokenization
Output :{‘<OOV>’: 1, ‘my’: 2, ‘love’: 3, ‘dog’: 4, ‘i’: 5, ‘you’: 6, ‘cat’: 7, ‘do’: 8, ‘think’: 9, ‘is’: 10, ‘amazing’: 11} [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Now we can notice that the length of the sentences has been retained and the unseen words in the sentence are replaced by the “<OOV>” token. So the resultant sentences are like :

‘i really love my dog’ = [5, 1, 3, 2, 4] = ‘i <OOV> love my dog’‘my dog loves my manatee’ = [2, 4, 1, 2, 1] =‘my dog <OOV> my <OOV>’

We still lost some meaning, but a lot less and the sentences are of at least the correct lengths. And while it helps to maintain the sequence length to be the same length as the sentence, we might wonder, when it comes to needing to train a Neural Network, how can it handle sentences of different lengths?

With images, they are all usually the same size. So how would we solve that problem?

Padding the sequences:

A simple solution is padding. For this, we ill use pad_sequences imported for the sequence module of tensorflow.keras.preprocessing. As the name suggests, we can use it to pad our sequence. So we just need to pass sequences to pad_sequence function and the rest is done for us.

padding the sequences
Output:{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11} [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]] [[ 0 0 0 5 3 2 4] 
[ 0 0 0 5 3 2 7]
[ 0 0 0 6 3 2 4]
[ 8 6 9 2 4 10 11]]

So our first example [5, 3, 2, 4] is preceded by 3 zeros in the padded sequence. But why 3 zeros? Well, it’s because our longest sentence has 7 words in it, so we pass this corpus sequence to pad sequence, it measures that and ensures that all of the sentences have equally-sized sequences by padding them with zero’s at the front. Note that OOV is 0, it is not 1.

Now we might think that we don’t want zero’s in the front, but instead after the sentence. Well, that’s easy. We can just the padding parameter to “post” i.e padding = “post”.

Or if we don’t want the length of the padded sentences to be the same as the longest sentence, we can then specify the desired length by specifying the maxlen parameter to the required length. But wait? we might think what happens if the sentences are longer than the maxlen parameter?

Well, then we can specify hot to truncate the sentence whether by chopping off the words at the end, with a post truncation, or from the beginning with a pre-truncation. Please refer to pad_sequences documentation for other options.

The function pad_Sequences might then look like :

padded = pad_sequences(sequences,maxlen = 5, padding=’post’, truncating = ‘post’)

What’s coming next:
Now that we’ve seen hot to tokenize our text and organize it into sequences, in the next tutorial we will take that data and train a neural network. We’ll take a look at a data set with sentences that are classified as sarcastic or not and we will use that to determine if sentences contain sarcasm.

Full code as a Jupiter notebook is available in my GitHub.

Make sure to follow me in order to get updates on this series of tutorials on NLP.