The Ultimate Guide To SMS: Spam or Ham Detector

Original article was published by Mala Deep on Artificial Intelligence on Medium


TL;DR Understanding spam or ham classifier from the aspect of Artificial Intelligence concepts, work with various classification algorithms, and select high accuracy producing algorithms and develop the Python Flask App.

The blog is a series of the blog post, if you haven’t read the theoretical Artificial Intelligence concept of spam or ham classifier and have not worked with algorithms in jupyter notebook, please explore it at:

We have covered in part 1 & 2

  • Theoretical AI Concept Regarding Spam or Ham Classifier
  • Classification Algorithms
  • Exploring Data Source
  • Data Preparation
  • Exploratory Data Analysis

We will cover here in Part 3

  • Naïve Bayes Behind Spam or Ham
  • Performance Measurement Criterion
  • Development of Spam or Ham Detector
Designed by Author. Illustration from unDraw.

Naïve Bayes Behind Spam or Ham

One of the most useful applications of the Bayes rule is the so-called naive Bayes classifier.

The Naïve Bayes algorithm creates a probabilistic model for classification of SMS messages. Even though all features contribute towards the overall probability of classification, Naïve Bayes algorithm assumes that the features are statistically independent of each other[10]. Although this assumption may not hold true for all cases, Naïve Bayes algorithm has shown promising results in comparison with other well-known classification algorithms. An advantage of Naïve Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification and as small dataset size Naïve Bayes classifiers can outperform the more powerful alternatives[18]. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

Bayes Rule. Image by Nils Jacob Sand

Thus for SMS classification in spam or ham, the probability that the project want to compute is:

Image by Sebastian Raschka.

Where x is a feature vector containing the words coming from the Spam (or Ham) SMS text[11].

Prior(spam) = “the probability that any new message is a spam message”

ham : P(ham) = 1− P(spam)

Performance Measurement Criterion

It is the comparative number of messages rightly classified, the percentage of messages rightly classified is used as an added measure for evaluating performance of the filter. It has however been highlighted that using Accuracy as the only performance indices is not sufficient. Other performance metrics such as recall, precision, Area Under the ROC Curve (AUC) and derived measures used in the field of information retrieval must be considered, so also is false positives and false negatives used in decision theory[12]. ROC curves and their relatives are very useful for exploring the tradeoffs among different classifiers over a range of costs. Roughly speaking, the larger area under the curve shows the better performance. To determine other three criterions, first the project should define some terms:

True positive (TP): The rate of legitimate SMS messages that have been classified correctly.

False positive (FP): The rate of spam SMS messages that have been classified correctly.

True negative (TN): The rate of legitimate SMS messages that have been incorrectly classified as spam messages.

False negative (FN): The rate of Spam SMS messages that have been incorrectly as legitimate messages.

False-positive error, which diverts a legitimate SMS as spam is generally considered more serious than a False-negative.

Now the so called performance measurement criterions could be defined as:

Performance criteria for the model. Image by Author.

Actually, Recall determines the proportion of legitimate messages(text), which have been correctly categorized, precision determines the proportion of all correctly categorized messages(text), which are legitimate, and accuracy determines the proportion of all messages(text), which have been categorized correctly.

From the project(see the code) on using Naïve Bayes confusion matrix is:

Confusion matrix values. Image by Author.

Which says that the project does have symmetric datasets where values of false positive(14) and false negatives(15) are almost the same. Therefore, the project does not need to look at other parameters to evaluate the performance of the model[19]. However, for study purposes, the project calculated the following matrices.

Performance matrices of the model. Image by Author.

Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all text that labeled as spam, how many actually are spam? High precision relates to the low false positive rate. The model has got 0.99 precision which is pretty good.

Precision Formula. Image by Author.

Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to all observations in actual class — yes. The question recall answers is: Of all the text that truly spam, how many did the model label? The model has a recall of 0.93 which is good for this model, as it’s above 0.5.

Recall Formula. Image by Author.

Starting with Artificial Intelligence concept like agent, environments, PEAS, exploring seven binary classification algorithms, performing vectorization,TF-IDF, EDA ,implementing the algorithms and evaluating the performance from confusion matrix, precision, recall, we created web-based SMS:Spam or Ham Detector.

Conclusion

Naïve Bayes classification algorithm is effectively useful for dealing with categorical data classification. The fundamental theory it uses is the Bayes conditional probabilistic model for finding a posterior probability given certain conditions. It is called “Naïve” because under the assumption that all features (collections of words) in the dataset are equally important and independent. Using the Naïve Bayes classification algorithm, the project got more than 98% accuracy in predicting a spam message based on the words it contains. To make the predictions more accurate the project needs to increase the number of data in the data set.

Thus this concludes our work. Now if you want to implement the code yourself then please visit my GitHub and clone the book and play with it.

Interact with the final product

If you have any queries regarding the article or want to work together on your next data science project, ping me on LinkedIn.