Source: Deep Learning on Medium
Mathematical Introduction To N-Gram Language Model
NLP is wide field.Natural Language Processing is the process or ability of computer to learn more about Human language so it could be process beyond human capabilities. We can train models such a way that they can predict statements or summarize biggest text block.It is interesting field for researchers who want to understand language better or make auto translation and auto typing more efficient.
The model that assign probabilities to sequence of words are called ‘Language model.
To say in simple language
If I have a partial sentence like
I eat ….
Now there are choices of word like pizza,burger,snake.
Now every word have some probability with previous word.Every word matches with previous verb but we see snake is different word.but still there is probability with our partial statement.So basic probability for sentence should be
probability (statement(I eat burger or pizza))=burger or pizza|I eat
This probability would be higher than snake or any other word. Basically language models predict consecutive term with highest value in probability.Models usually take argsmax function to take highest value in probabilites.
N-gram term comes various times in NLP models.N-gram is term comes when number of words used in model like
Navi Mumbai -2 gram
She ate Hamburger -3 gram
He shoots a duck-4 gram
Lets look into mathematical function
consider the sentence has w1,w2,…..,wn
where n is number of words in sentence
After applying same rule to our words.We get final result like this.
Still last term become too complicate to compute mathematically in large model.
We have solution for that called ‘Markov Assumption’
Markov Assumption is the assumption that probability of a word depends only on a previous word .
Problem with above method is probablilty weights are normalized .
Solution for this is simple add fake start and end token.
How to we train language model ?
- Log-likelihood model
We compute word y probability with word x then calculate c(xy) then normalize it.
we can’t use raw probablity while training models.We use specific variant of it.Most important thing is low perplexity is always good.
Again one problem is there what about out of vocab word occured then probability is 0 for that word and perplexity become infinity. 🙂
- That is where <unk> token comes in handy
- build vocabulary beforehand.
- IMP Smoothing.
I will cover smoothing in next blog.
Feel free to reach out to me on email@example.com