Generating Singlish Text Messages with a LSTM Network

Source: Deep Learning on Medium

After publishing my first post on Medium, I realized how enjoyable it was to go through the whole process of conceptualizing an idea to sharing my findings and learning experience. I realized most data-related projects and solutions arise from these 2 things — 1) A problem we are trying to solve 2) An opportunity from the data that we have. Since my previous post was about solving the problem of manual work in Social Media Contests, I figured that I should look to finding data first this time.

Google Dataset Search

Google recently released Google Dataset Search, a Google Scholar for datasets, and this came in handy in helping me get started. I came across a really interesting dataset from my school, The National University of Singapore SMS Corpus. It is a corpus of more than 50,000 SMS messages in Singapore English (Singlish) and it was part of a research work from the Department of Computer Science. The messages largely originated from volunteers of the study who are Singaporeans attending the University.

An opportunity to understand our language

I thought it was such an amazing opportunity to study the language of those texts especially because I am a student myself in NUS and I speak and text in Singlish all the time. For the uninitiated, Singlish seems to be a broken version of English or a very crude slang even. It is actually very universal in Singapore despite the seeming lack of coherence and semantics. In fact, it can often be used to establish a connection and trust with Singaporeans immediately.

You can come from all walks of life and despite your race and mother tongue, as long as you’re Singaporean, you will totally get this.

Singlish can also be incredibly hard to get when a single word can change the entire meaning of the message, that is also one of the reasons why it is a very efficient language.

The nuances of Singlish.

Singlish in text messages is on another level. Besides the lack of complete sentences, the texting and Internet language just shortens it even more.

While I am no linguistics expert, I thought it could be useful to understand this by training a neural network on the corpus to generate similar text messages. I want to understand and demonstrate the reasons for choosing the final representation of our model for our text generation.

Scroll down to Code if you don’t want to understand NNs, RNNs and LSTMs.

Feed-Forward Neural Nets

An artificial neural network models our brain and represents the neurons with nodes. The network has an input layer which takes in information, hidden layers which processes the info (manipulation, computation, feature extraction) and a final output layer which generates a desired output based on the information which is usually used to make a prediction.

A simplistic representation of a neural net.

The predicted output and the actual output can be very different and this is measured with a cost function which we want to minimize.

  1. The neural net is just like a baby, he wants to learn how to speak properly (model output Yhat)
  2. by doing a certain set of actions such as uttering, … , shouting random things(X_1 to X_n),
  3. some more frequently than the other (W_1 to W_n).
  4. He attempts to say what is right by iteratively trying out different scenarios of actions in his mind (update weights).
  5. The final set of actions he chose at that point in time is as close as possible to what his parents tell him is right (minimize cost function C).

That is why it is a form of Supervised Learning. This updating/understanding process in his mind is also known as Backpropagation.

The feedforward neural net is the first and simplest type of artificial neural network devised and it only allows signals to travel from input to output. It has no element of time. When it comes to text generation, we are trying to predict the next word given a sequence of words and we need to have a model that represents memory.

Recurrent Neural Nets

RNN is a type of ANN that has a recurring connection to itself.

This recurring connection helps the RNN to learn the effect of previous inputs X_t-1 (a vector) along with the current input X_t (a vector) while predicting the output at time Yhat_t. This gives RNN a sense of time. It allows the baby to learn from past scenarios when he got scolded and avoids making the same mistakes.

Let’s visualize a multi-layer vanilla RNN

Recurrent Neural Network with L layers. In each layer, the input (a vector) is unrolled into t different states.
Activation function in each state

Each output (h) from a state (blue) is the activation function of the output from the previous state (h_t-1) and the current input vector (X_t). These outputs h_1 to h_t of the first layer will then be fed as an input into the next layer as it goes deep into the RNN.

This will allow us to predict the next word given the context of a sentence.

Based on the context of the sentence, we can find the right word “meh” to use after “can”.

However, when the required context of the sentence gets very large, it might not be so important to remember words such as “so”. Unfortunately, as that gap grows, traditional RNNs become unable to learn to connect the information because it is unable to ignore or forget all the unnecessary information.

Long Short-term Memory (LSTM)

The LSTM model is able to represent the actions of taking in information (input gate), give out a predicted value (output gate) and leaving out unimportant information (forget gate). The LSTM is very popular in sequence modelling tasks such as text generation. Similar to the previous RNN diagram, a LSTM network will have LSTM cells in place of the nodes.

Let’s visualize a single layer made up of LSTM cell states

t LSTM cell states in a single layer with 1 Cell State Belt and 3 Gates in each cell state.

The LSTM structure and vanilla RNN structure is very similar on the outside but the main difference is what is within a single cell state. This will help us to model the different states in time where we are able to input, output and forget information.

The Cell State Belt allows information to flow from one state to the other.

Every gate has a sigmoid function that will return an output of 0–1 which represents the proportion of information that passes through the gate. Each gate will have the weight functions W and U as well as the bias term b.

𝞂(W*Xt + U*ht-1 + b) = [0,1]

The Forget Gate (in purple) allows the current state to only retain a proportion of information from the previous state.
The Input Gate (in green) allows us to decide the proportion of information that should be updated in the current state.
Having both the Forget Gate and Input Gate allows us to both retain past information and update current information in this state.
The Output Gate (in red) allows us to decide how much information in the state should finally “release”, giving us the output h_t for that state.

With some understanding of LSTMs, we can finally explore the corpus and build our model.


I’ll be using Python 3.6 and Keras for this task. First we will parse the data and tokenize each text.

Number of Users: 343
Number of Texts: 55835
Sequence Length: 5

Creating the Vocabulary

After taking only the first 1000 messages, the modal length of a text is 5 and we will use that as our sequence length. Basically we will use the past 4 words in a sentence to predict the next word.

['just', 'now', 'i', 'heard', 'thunder', 'but', 'e', 'sky', 'still', 'looks', 'ok', 'hee', 'if', 'really', 'rain', 'den', 'i', 'no', 'need', 'to', 'run', 'liao', 'i', 'also', 'lazy', 'but', 'no', 'choice', 'have', 'to', 'force', 'myself', 'to', 'run']

The above is an example of a tokenized text message. The drawback of my model is that I excluded punctuation but it could also be modeled otherwise.

We can get the vocabulary of all the texts (unique words).
Vocab Size: 1490
Total words 10419
Vocab / Total words ratio: 0.143

Next we must encode the words with numbers so that they can be fed into the neural network.

Representing words with numbers

The tokenizer.word_index returns a dictionary mapping the word to and index, while tokenizer.index_word returns the reverse. Each word is encoded with an index and this index is actually the position to fire up in the respective one-hot vector array.

This helps the neural net understand the vocabulary which are now represented by one-hot vectors. However, this results in a very large and sparse matrix which takes up 5×6 cells’ worth of space currently but grows by a lot when the size of the vocabulary increases.

We can use a Word Embedding layer to map the representation into a specified number of dimensions. A recommended size in practice (I found online) is vocab_size**0.25 but in the following example, I will use an embedding size of 3.

Size is reduced when word embeddings are used. The similarity between words will also be represented based on the distance in the vector space.

Model Architecture

We can use a simple architecture with a Sequential model —

  1. Embedding Layer
  2. Bidirectional LSTM Layer (to learn the previous and future context of a sentence, won’t go into the details here)
  3. Dropout Layer (prevent overfitting)
  4. Dense layer to map output size back to the vocab_size
  5. Activation using Softmax to find the most likely category(word) in the vocabulary to use

Prepare training data

With our model set up and texts encoded, the next step is to prepare the sequence data to be trained.

This many-words-to-many-words context helps us to train the model by telling them which sequence of words (our predictors X) leads to the final word (our label Y).

Using the green words as X and red word as Y

Finally we will compile and fit our model using —

  1. Adam Optimizer (popular for Deep Learning and easy to configure)
  2. Sparse Categorical Cross Entropy (Cost/Loss function for multi-class classification where target outputs are integer indices instead of one-hot encoded)
  3. ModelCheckpoint to save optimal weights each time accuracy improves
  4. EarlyStopping to stop training when validation accuracy does not increase for 4 times consecutively.

We will also save all our objects in a Pickle file so that we can reload them when generating our texts.

The training process will look something like this

Epoch 39/100
loss: 4.2459 - sparse_categorical_accuracy: 0.2050 - val_loss: 6.4890 - val_sparse_categorical_accuracy: 0.0924
Epoch 00039: sparse_categorical_accuracy improved from 0.20413 to 0.20503, saving model to best_weights.hdf5
Epoch 40/100
loss: 4.2390 - sparse_categorical_accuracy: 0.2051 - val_loss: 6.4887 - val_sparse_categorical_accuracy: 0.0935
Epoch 00040: sparse_categorical_accuracy improved from 0.20503 to 0.20513, saving model to best_weights.hdf5

Finally, we can create a generate_text function that takes in a seed sentence “i will be” for example, pad it to the correct sequence_length and use it to predict the next word iteratively.

Code for generate_text() in Link.

Text generation at work

Well it sounds like Singlish in text messages indeed! The model has managed to learn the grammar even with incomplete spelling and even though the text is forced to have a length of 5 words, it is rather comprehensible.

Reality Check

For the interest of time and money, I did not fully train the network on the entire corpus. In fact I only used it on 1000 texts just to test the entire flow. The validation accuracy was extremely low and the model was certainly over-fitting. I also did not test using an optimal network structure nor tune any parameters. I also used but it only had 2 hours free for their GPU. I am currently waiting to get my AWS student account verified so that I could train the model on the entire corpus. However, my final exams are coming and I could not wait any longer.

It was nonetheless a very interesting problem and a good learning experience. Looking forward to learning, exploring and sharing more when I am back!