Mathematical Introduction To N-Gram Language Model

Source: Deep Learning on Medium

Mathematical Introduction To N-Gram Language Model

NLP is wide field.Natural Language Processing is the process or ability of computer to learn more about Human language so it could be process beyond human capabilities. We can train models such a way that they can predict statements or summarize biggest text block.It is interesting field for researchers who want to understand language better or make auto translation and auto typing more efficient.

Language Model

The model that assign probabilities to sequence of words are called ‘Language model.

To say in simple language

If I have a partial sentence like

I eat ….

Now there are choices of word like pizza,burger,snake.

Now every word have some probability with previous word.Every word matches with previous verb but we see snake is different word.but still there is probability with our partial statement.So basic probability for sentence should be

probability (statement(I eat burger or pizza))=burger or pizza|I eat

This probability would be higher than snake or any other word. Basically language models predict consecutive term with highest value in probability.Models usually take argsmax function to take highest value in probabilites.

N-gram Models

N-gram term comes various times in NLP models.N-gram is term comes when number of words used in model like

Navi Mumbai -2 gram

She ate Hamburger -3 gram

He shoots a duck-4 gram

Lets look into mathematical function

consider the sentence has w1,w2,…..,wn

where n is number of words in sentence

Chain rule for Probability.

After applying same rule to our words.We get final result like this.

Still last term become too complicate to compute mathematically in large model.

We have solution for that called ‘Markov Assumption

Markov Assumption is the assumption that probability of a word depends only on a previous word .

Problem with above method is probablilty weights are normalized .

Solution for this is simple add fake start and end token.

Simple example with adding fake start and end token,

How to we train language model ?

  1. Log-likelihood model

We compute word y probability with word x then calculate c(xy) then normalize it.

Simplified version of formula


we can’t use raw probablity while training models.We use specific variant of it.Most important thing is low perplexity is always good.

Calculating perplexity

Again one problem is there what about out of vocab word occured then probability is 0 for that word and perplexity become infinity. 🙂


  1. That is where <unk> token comes in handy
  2. build vocabulary beforehand.
  3. IMP Smoothing.

I will cover smoothing in next blog.

Feel free to reach out to me on