Building Spam Classifier-NLP in Python From Scratch

Original article was published by Uzomeziem Eze on Artificial Intelligence on Medium


1. Import Dependencies

We are going to make use of NLTK for processing the messages, WordCloud, and matplotlib for visualization and pandas for loading data, NumPy for generating random probabilities for the train-test split.

2. Loading Data

If we look closely, we can see we do not need the columns ‘Unnamed: 2’, ‘Unnamed: 3’ and ‘Unnamed: 4’, so we are going to remove them. Next, we rename the column ‘v1’ as ‘label’ and ‘v2’ as ‘message’. ‘ham’ (ham being authentic messages) is replaced by 0 and ‘spam’ (fake messages) is replaced by 1 in the ‘label’ column.

What we get is below:

3. Train-Test Split

Needed for every ML workflow, it is time to test our model. We should split the data into the training dataset and test dataset. We will use the training dataset to train the model and then it will be tested on the test dataset. For this project, we shall use 75% of the dataset as a training dataset and the rest as a test dataset. Selection of this 75% of the data is uniformly random.

4. Visualizing data

Now, let us see which are the most repeated words in the spam messages! We are going to use the WordCloud library for this purpose.

This results in the following:

As expected, the messages mostly contain words like ‘FREE’, ‘call’, ‘text’, ‘ringtone’, ‘prize claim’ etc. Which are common spam words in spam messages I’m sure we can all relate too.

Similarly, the WordCloud of ham messages is as follows:

5. Training the model

Here is where all the meat of the model is made. I am going to implement two techniques: Bag of words and TF-IDF. Let us first start off with Bag of words.

Quick Preprocessing: Before starting with training we must preprocess the messages. First of all, we shall make all the character lowercase so we have no case errors. For example, ‘free’ and ‘FREE’ mean the exact same thing so we do not want to treat them as two different words.

Then we will use the NLTK library to tokenize each message in the dataset. Tokenization is the process of splitting up a message into pieces and getting rid of the punctuation characters. For example:

https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

Then words like ‘go’, ‘goes’, ‘going’ indicate the same activity. We can replace all these words with a single word ‘go’. This is called stemming. For this process, we are going to use Porter Stemmer (also in the NLTK library).

Just to be overly cautious, we are going to remove the stop words. Stop words are those words which occur extremely frequently in any text. For example words like ‘the’, ‘a’, ‘an’, ‘is’, ‘to’ etc. These words do not give us any information about the content of the text. Thus it should not matter if we remove these words for the text.

Optional: You can also use n-grams to improve accuracy. As of now, I only dealt with 1 word. But when two words are together the meaning totally changes. For example, ‘good’ and ‘not good’ are the opposite in meaning. Suppose a text contains ‘not good’, it is better to consider ‘not good’ as one token rather than ‘not’ and ‘good’. Makes sense?

Therefore, sometimes accuracy is improved when we split the text into tokens of two (or more) words than the only word.

Bag of Words: In the Bag of words model we find the ‘term frequency’, meaning we are going find the number of occurrences of each word in the dataset. Thus for word w(authentic message),

and for spam:

Now TF-IDF was something new I learned while working on this project, but as I will explain, you will see why this turned out to be the more useful model.

TF-IDF: TF-IDF stands for Term Frequency-Inverse Document Frequency. In addition to Term Frequency, we compute Inverse document frequency.

Sounds like a lot. Let me explain…

There are two messages in the dataset. ‘hello world’ and ‘hello foo bar’. TF(‘hello’) is 2. IDF(‘hello’) is log(2/2)[The inverse of TF, remember?]. If a word occurs a lot, it means that the word gives less information.

In this model, each word has a score, which is TF(w)*IDF(w). The probability of each word is counted as:

Additive Smoothing: Now one of the main issues of both model is for example: what if we encounter a word in the test dataset which is not part of the training dataset?

In that case, P(w) will be 0, which will make the P(spam|w) undefined (since we would have to divide by P(w) which is 0. Remember the formula? 💭). To tackle this issue we introduce additive smoothing. In additive smoothing, we add a number alpha to the numerator and add alpha times the number of classes over which the probability is found in the denominator.

Using Bag of Words:

When using TF-IDF:

This is done so that the least probability of any word now should be a finite number. The addition in the denominator is to make the resultant sum of all the probabilities of words in the spam emails as 1.

When alpha = 1, it is called Laplace smoothing but that’s a topic for another medium story.

6. Classification

For classifying a given message, first, we have to preprocess it. For each word w in the processed messaged we find a product of P(w|spam). If w does not exist in the training dataset we take TF(w) as 0 and find P(w|spam) using the above formula. We multiply this product with P(spam) The resultant product is the P(spam|message). Similarly, we find P(ham|message). Whichever probability among these two is greater, the corresponding tag (spam or ham) is assigned to the input message.