Introduction to Language Modelling and Deep Neural Network Based Text Generation

Source: Deep Learning on Medium

Introduction to Language Modelling and Deep Neural Network Based Text Generation


NLP studies involve a number of important tasks like text classifications, sentiment analysis, machine translation, text summarization etc. One other core tasks of NLP is related with language modelling which involves generating text, conditioned on some input information. Before the recent advancement in deep neural network models, the most commonly used methods for text generation were either based on template or rule-based systems, or probabilistic language models such as n-gram or log-linear models [Chen and Goodman, 1996, Koehn et al., 2003]. Language Model is the task of predicting what word comes next or more generally, a system that assigns probability to a piece of a text sequence.N-gram is the simplest language model and its performance is limited by its lack of complexity. Simplistic models like this one cannot achieve fluency, enough language variation and correct writing style for long texts. For these reasons, neural networks (NN) are explored as the new main standard despite their complexity. And Recurrent Neural Networks (RNN) became a fundamental architecture for sequences of any kind. RNN is nowadays considered as the default architecture for text but RNNs have problems of their own: it cannot remember for long the content of the past and it struggles to create long relevant text sequences because of exploding or vanishing gradient problems. For these reasons, other architectures such as Long Short Term Memory (LSTM) [Alex Graves et all, 2014] and Gated Recurrent Units (GRU) [Kyunghyun Cho et al, 2014] were developed and became the state of the art solution for many language generation tasks. In this post, we will be using LSTM to generate sequences of text.

Language Model

Models that assign probabilities to sequences of words are called language models. There are primarily two types of Language Models:

1) Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words.

2) Neural Language Models: They use different kinds of Neural Networks to model language and have surpassed the statistical language models in their effectiveness.

N-Gram Models

We have described language models as calculating the probability of next word given a sequence of words. Let’s begin with the task of computing P(w|h), the probability of a word w given some history h. Suppose the history h is “its water is so transparent that” and we want to know the probability that the next word is “the”:

P(the|its water is so transparent that).

Instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words. Below is the mathematical representation of different n-gram models:

For example, for bigram conditional probability can be calculated as:

Limitations of N-gram models:

N-gram based language models do have a few drawbacks:

• N-gram is the simplest language model and its performance is limited by its lack of complexity. Simplistic models like this one can not achieve fluency,

• The higher the N, the better is the model usually. But this leads to lots of computation overhead that requires large computation power in terms of RAM.

• N-grams are a sparse representation of language. This is because we build the model based on the probability of words co-occurring. It will give zero probability to all the words that are not present in the training corpus.

Due to these drawbacks, we will be building our character based text generation model based on neural network architecture.

RNN’s (Recurrent Neural Networks):

A major characteristic of most neural networks such as densely connected networks and convnets, is that they have no memory. Each input shown to them is processed independently, with no state kept in between inputs. With such networks, in order to process a sequence or a temporal series of data points, you have to show the entire sequence to the network at once: turn it into a single data point.

A recurrent neural network (RNN) processes sequences by iterating through the sequence elements and maintaining a state containing information relative to what it has seen so far. In effect, an RNN is a type of neural network that has an internal loop.

Simple RNN architecture

Types of RNN

Image from Andrej Karpathy Blog “Unreasonable Effectiveness of Recurrent Neural Networks”

There are five different types of RNN models:

(1) Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g.image classification).

(2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words).

(3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment).

(4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French).

(5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video).

Pseudocode for RNN

Simple RNN unrolled over time

Disadvantages of RNN’s:

Major disadvantages of RNN’s are:

• Vanishing Gradients (this problem can be solved if you use LSTM or GRU’s)

• Exploding Gradients (this problem can be solved if you truncate or squash the gradients)

Short Term Dependencies:

Predict the last word in “the clouds are in the sky,”

Image from Christopher Olah Blog “Understanding LSTM Networks”

When the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

Long Term Dependencies:

Predict the last word in the text “I grew up in France… I speak fluent French.”

Image from Christopher Olah Blog “Understanding LSTM Networks”

The gap between the relevant information and the point where it is needed to become very large. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

So, as a summary, RNN’s are not able to keep track of long-term dependencies. It cannot process very long sequences. That’s why we will use an upgraded version of RNN called LSTM model.


Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.

They work tremendously well on a large variety of problems, and are now widely used.

The repeating module in an LSTM

LSTM is made up of 3 gates.

1) Input Gate: We decide to add new stuff from the present input to our present cell state scaled by how much we wish to add them.

2) Forget Gate: After getting the output of previous state, h(t-1), Forget gate helps us to take decisions about what must be removed from h(t-1) state and thus keeping only relevant stuff.

3) Output Gate: Finally we’ll decide what to output from our cell state which will be done by our sigmoid function.

And here are the formulas used for an LSTM networks:

Implementing Character Level Text Generation:

The universal way to generate sequence data in deep learning is to train a network to predict the next token or next few tokens in a sequence, using the previous tokens as input.

Image from François Chollet’s book of Deep Learning with Python

When generating text, the way you choose the next character is crucially important. A naive approach is greedy sampling, consisting of always choosing the most likely next character. But such an approach results in repetitive, predictable strings that don’t look like coherent language.

A more interesting approach makes slightly more surprising choices: it introduces randomness in the sampling process, by sampling from the probability distribution for the next character. This is called stochastic sampling.

Sampling probabilistically from the softmax output of the model is neat: it allows even unlikely characters to be sampled some of the time, generating more interesting looking sentences and sometimes showing creativity by coming up with new, realistic sounding words that didn’t occur in the training data.

In order to control the amount of stochasticity in the sampling process, I have used a parameter called the softmax temperature that characterizes the entropy of the probability distribution used for sampling. It characterizes how surprising or predictable the choice of the next character will be. Given a temperature value, a new probability distribution is computed from the original one (the softmax output of the model) by reweighting it in the following way.

Sampling Strategy:

Image from François Chollet’s book of Deep Learning with Python

Training Data

In this study, I have used some of the writings (İnce Memed 1 and İnce Memed 2) of Yaşar Kemal (modern Turkish author) to train an LSTM network.

The language model we’ll learn will be specifically a model of Yaşar Kemal’s writing style and topics of choice, rather than a more generic model of the Turkish language.

Corpus length: 1420227

Number of sequences: 473389

Unique characters: 56

[‘\n’, ‘ ‘, ‘!’, ‘“‘, “‘“, ‘,’, ‘-’, ‘.’, ‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘:’, ‘;’, ‘?’, ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’,‘i’, ‘j’, ‘k’, ‘l’, ‘m’, ’n’, ‘o’, ‘p’, ‘r’, ‘s’, ‘t’, ‘u’, ‘v’, ‘x’, ‘y’, ‘z’, ‘â’, ‘ç’, ‘ë’, ‘î’, ‘ö’, ‘ü’, ‘ğ’, ‘ı’, ‘ş’, ‘̇’, ‘̈’]


I have used Keras library to train an LSTM network with the above mentioned training data.

Here is the training loss after 23 epochs:

Here is the training accuracy after 23 epochs:

Generated Text After 1st Epoch

— — Generating with seed: “u. halbuki memed onun tam aksi. sevinç içinde. memed de kapı»

— — — temperature: 0.2

u. halbuki memed onun tam aksi. sevinç içinde. memed de kapıya da ali safa bey de onun at da bir gelecek bir atları da atın da bir sara bir tarladı. bu da da ali safa bey olursun da bir de bir at başından da atın arak bir beni de bir sesle kara da bir tarla da yanında sara da bir at bir de da bir de beni da bir de da kara saran söylemedi. bu kararak da da bir de da da da başını bir de bir serinden bir de kadar bir içinde bir at sağın bir at başını da da da

— — — temperature: 0.5

bir de kadar bir içinde bir at sağın bir at başını da da da size çakırdikenlik bir çuyunların bulutuları benim kim bilir ki sız soradın başını bir sapa baktı. da de karan da kalanlarının üstüne karastır. olmadan kalıyordu. düşünü kara var. onun kadar sustun da yapıların da oturdular kadar bir bularak kalınıyordu. ali safa bey seri da at geliyordu. sesi gibi memed de kimeler bir de köylerinden vardı. da yakıp kapıya da çalına da sonra başını gibi de s

— — — temperature: 1.0

n vardı. da yakıp kapıya da çalına da sonra başını gibi de sağsız da attı yukaları seni gibi. artadağı suylarını aldırsın memed atlar oradamiz diye mit daha tartısı ova toprakasın öldürdü. çazar delikleri mi dalartdı. sonra bir korkayıp geçorme, dermiyor, dal vardı. döndü. sarkeni candarman ötlemiyordu. de sesle sarttı, gizsen. bazların battı. geliyor, vurmayı boşu bir iğ tış, esme na da kadar allah sauk olyaradılar, izrisinlediğini ata ya. da onun şah

Generated Text After 20th Epoch

— — Generating with seed: “ bacakları üstüne ancak dikilebilen koca osman, atlılar ge»

— — — temperature: 0.2

bacakları üstüne ancak dikilebilen koca osman, atlılar geçirdi. memed bir de karanlığa karartı bir karanlık kaldı. sonra bir türlü bir karanlık bir kurşun ali safa bey bu ali safa bey de kalabalık kaldı. kadınların bir karanlık karışının bu sabahları bir de ali safa bey de bir karanlık karanlık duruyordu. bir toprak karanlık bir baban gelirdi. arkasında bir karanlık gibi değildi. bu kadar bir kurşun karanlık bir yanını kaldı. bir anlar kaldı. bir anla

— — — temperature: 0.5

kurşun karanlık bir yanını kaldı. bir anlar kaldı. bir anların başını kalmasın başında döndü. bu sevin ana bu ben bu büyük çekiyor. ali safa bey onu sonunu geçmiyordu. bu köylüler çok yaşardı. bana gelir. adam kokusu durduğu kara bir kurşun ağamızı düşündü. ayrıldı. memed bu yanda kırmızı kalmış, insan karanlığın altında senin bir yarasını olmaz. o da bağırarak en yaşanıyordu. o düşünür. karanlık bir konuşuyordu. ne desin işlerinin atların içind

— — — temperature: 1.0

. karanlık bir konuşuyordu. ne desin işlerinin atların içinden. ı̇şte: gözündeki verme göne gözlerini banambandı. uyukoysunu turadan kalmış, bu doyucuları hiç gelken “yerlerde çıkardı… dimli kayanın bir diyü geçiyor. birlik seler ne yaptılam da idiyordu. karanlarca durdu. sizili bir kap şeyi gelir tenk içinde insanın altındaki devaşın dinini yüz ağılda… süleyman: a, diye dikte sızanın saza ğeni patı gittiği çizdi.

Result Analysis:

As can be seen from the outputs, a low temperature value results in repetitive and predictable text, but local structure is highly realistic: in particular, all words are real Turkish words.

With higher temperatures, the generated text becomes more interesting, surprising, even creative; it sometimes invents completely new words that sound somewhat plausible (such as banambandı and karanlarca).

By training a bigger model, longer, on more data, you can achieve generated samples that look much more realistic than this one. But, of course, you should not expect to generate any meaningful text, other than by random chance: all you’re doing is sampling data from a statistical model of which characters come after which characters.


You can generate discrete sequence data by training a model to predict the next tokens(s), given previous tokens. In the case of text, such a model is called a language model. It can be based on either words or characters. Sampling the next token requires balance between adhering to what the model judges likely, and introducing randomness. One way to handle this is the notion of softmax temperature. You should experiment with different temperatures to find the right one.


[1] Alex Graves. “Generating Sequences With Recurrent Neural Networks”, 2014.

[2] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio

“Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, 2014.

[3] Sepp Hochreiter, Jurgen Schimidhuber “Long Short Term Memory”, Neural Computation: 9(8): 1735–1780,1997.

[4] Andrej Karpathy Blog “Unreasonable Effectiveness of Recurrent Neural Networks”, May 21 2015.

[5] Christopher Olah Blog “Understanding LSTM Networks”, Aug 27 2015.

[6] François Chollet’s book “Deep Learning with Python” 1st Edition.