The basics of Language Modeling

Source: Deep Learning on Medium

N-Gram language models

Intuitively, in general, more common words like “cat ”or “dog ”should tend to have higher probabilities then more uncommon ones such as aardvark or kingfisher. Thus, a good starting off point could be the frequency of words in the corpus. A system for this purpose that takes into account only the number of appearances of a word normalized by the number of words in the corpus is called a uni-gram language model. Similarly, bi-gram language models consider the frequency of couples of word, for example, if in our English corpus the couple [united, states] appears more often than [united, the] a bi-gram language model would assign a higher probability to “states ” rather then to “the” to follow “united ”despite the much higher frequency of the latter. Higher-gram language models also exist, but as the dimensions of the sequences of words increase, their frequency in the corpus decreases exponentially. These models thus have a sparsity problem and struggle with infrequent word sequences.