Introduction to N-grams,Language Modeling,Text-classification & Naive Bayes classification in…

Original article was published on Artificial Intelligence on Medium

Introduction to N-grams,Language Modeling,Text-classification & Naive Bayes classification in Natural Language Processing.

Language Modeling

Language Modeling is one of the most important topic in NLP. The goal of Language Modeling is to assign a probability to a sentence.( Why ?)

Let’s say for Machine translation of a good and a bad sentence

P(high winds tonight)>P(large winds tonight)

Or in Spelling correction

P(about fifteen minutes from)>P(about fifteen minuets from)

or in Speech Recognition

P(I saw a van)>>P(eyes awe of an)

so to compute a probability of a sentence or sequence of words is P(W)=P(W1,W2,W3….Wn) or probability of an upcoming word will be P(W5|W1,W2,W3,W4), a model that compute either of these is called a Language Model. So how we compute P(W) or how Language model actually works, is to rely on an intuition of chain rule of probability i.e P(A|B)=P(A,B)/P(B) or P(A,B)=P(A|B) P(B) , the Chain Rule in general is P(X1,X2,X3….Xn)=P(X1)P(X2|X1)P(X3|X1,X2)….P(Xn|X1…Xn-1) , but computing probability for a large dataset will be very time consuming , so to reduce that we use Markov’s Assumption i.e the probability of a sequence of word is the product of conditional probability of that word given some prefix of the last few words i.e P(W1,W2,W3….Wn)=P(Wi|Wi-k…..Wi-1) for example let’s say We have a phase “He ran so fast” then

P(that|He ran so fast)~P(that|fast)

The simplest case of the Markov’s Model is the uni-gram model where we have set of individual words(fifth,of,an, a….etc) and for each word we compute Probability, a slightly more intelligent is a bi-gram model(outside,new ,car, parking lot….etc) similarly there is tri-gram or we can extend our N-gram model to make it more intelligent .

Estimating N-gram Probability

So to find the probability we have P(Wi|PWi-1)=count (Wi-1,Wi)/count(Wi) where Wi is the desired word and Wi-1 is the previous word of Wi and count(Wi-1,Wi) is how many time Wi and Wi-1 occurs together. Let’s take an example to simplify this, Suppose we have a document

<S>I am bforBlack</S>

<S>bforBlack I am</S>

<S>bforBlack is the ted who codes</S>

P(<S>|I)=1/3 ,P(<S>|bforBlack)=2/3 ,P(bforBlack|I)=1/3 and so on, the best way to visualize will be a metric which tells which word has more number of counts after which word.

There are number of publicly available Language Modeling tool kits for example SRILM or the Google -N gram Release.

The goal of our Language Model must be to assign higher probability to real or frequently observed sentences.We need to train robust model that do a better job of generalizing.One kind of generalization is dealing with Zeros, by Zeros I mean things that never occurs in our training set but do occur in our test set, to deal with it we use add-one smoothing or Laplace smoothing.

Laplace Smoothing or Add-one Smoothing

Let’s say our Language model encounters some uni-grams or bi-grams which were never recognized by our language model and yes is a valid statement , working around with our previous approach unfortunately our prediction for that will be zero and cancelling out everything, so to overcome that we use Laplace or Add-one smoothing i.e sharing out a little amount of probability from others and cancelling out zeros in every case and hence modifying our formula as


where V is the vocabulary size.

Text Classification

One of the text classification task is gender identification i.e determining whether a given author is a male or a female. A research in gender identification shows that if we look into number of pronouns , determiners, noun phrases and other features , helps to determine the writer’s gender. Female writer’s tends to use more pronouns as male writer’s tends to use more facts and determiners or let’s say after seeing a movie review or a product I can categories it as positive or negative review for example if a review is “Unbelievable disappointed” we call it a negative review or “The greatest comedy movie I have ever seen” we call it a positive review.

Text Classification also helps in subject category hierarchy which you can say automatically indexing an artistically according to its category by just going through it’s headline or feature words. Some of the other examples of Text classification are Spam Detection,Authorship Identification,Language Identification, Sentiment Analysis.

How do we do Text Classification

Let’s say we get a document as d and we have a set of fixed classes c={c1,c2,c3…cn} and our job is to take a document and assign a class to that document , but how do we do that ?

A simple way is to have hand written rules for example if we are having spam detection then we might have a list of black listed emails or we might look for phrases like special characters or words like “millions of dollars or rupees” or “You have Been Selected”, if these rules are defined by an expert you can get a high accuracy but building and maintaining these rules are expensive.Spam Classification comes Supervised Machine Learning.

Naive Bayes Classification Algorithm

Naive Bayes is one of the most important algorithm for Text Classification. The intuition on Naive Bayes is based on Bayes rule and relies on a very simple way of document representation called the bag of words. Let’s see the intuition of bag of words representation.

Imagine I have a document(Movie Review of The Irish-Man) which says

“I loved this movie. It’s long but I think it needed to be to get you drawn into the long period of time the story is set over and also to capture the mood of the different eras. A bit slow but again I think the pace fitted the story and the pace was even throughout which is always a good thing. Also the editing kept the story unfolding in an interesting way. ”

Our Job is to build a function which take this document and returns a class(Positive or Negative).In order to solve this task one thing we might do is to look at individual words in this document like “loved,mood,good ,interesting” or look at all the words or look at some subsets.The bag looses all the information about the order of the words and focus on set of words that occurs and their count or we can say a vector of set of words with their count, so the idea of bag of words is to represent the document with the list of words and their count.

Formalizing The Naive Bayes Classification Algorithm

For document d and class c and our goal is to compute the probability of each class and its condition probability of a given document i.e P(c|d) and we gonna use this probability to pick the best class.

P(c|d)=(P(d|c)P(c))/P(d) or P(d|c)P(c) or P(x1,x2,x3…xn|c)P(c)

so if P(c1|d1)>P(c2|d1) we choose class c1 for document d1.

I hope till now you might be having a brief about N-grams,Language modeling,Text-classification and Naive Bayes Classification algorithm.