Sequence Models by Andrew Ng — 11 Lessons Learned

Source: Deep Learning on Medium

Go to the profile of Ryan Shrott

I recently completed the fifth and final course in Andrew Ng’s deep learning specialization on Coursera: Sequence Models. Ng does an excellent job describing the various modelling complexities involved in creating your own recurrent neural network. My favourite aspect of the course was the programming exercises. In particular, the final programming exercise has you implement a trigger word detection system. These systems can be used to predict when a person says “Alexa” or predict the timing of financial trigger events.

Sequence models, in supervised learning, can be used to address a variety of applications including financial time series prediction, speed recognition, music generation, sentiment classification, machine translation and video activity recognition. The only constraint is that either the input or the output is a sequence. In other words, you may use sequence models to address any type of supervised learning problem which contains a time series in either the input or output layers.

In this article, I will discuss 11 key lessons that I learned while taking the course. I also have articles detailing other key lessons I learned from the previous 4 courses in the specialization. For my computer vision article, click here. For all other deep learning courses, click here.

Lesson 1: Why not a standard network?

Traditional feedforward neural networks do not share features across different positions of the network. In other words, these models assume that all inputs (and outputs) are independent of each other. This model would not work in sequence prediction since the previous inputs are inherently important in predicting the next output. For example, if you were predicting the next word in a stream of text, you would want to know at least a couple of words before the target word.

Traditional neural networks require the input and output sequence lengths to be constant across all predictions. As discussed in lesson 2, sequence model networks can directly address this problem.

Lesson 2: What are the various RNN architecture types?

As discussed in the introduction, sequence models may address a variety of sequence prediction applications. In this course, the instructor discusses a variety of network types including one-to-one, one-to-many, many-to-one and many-to-many networks.

In music generation, the input may be the empty set and the output may be a song (one-to-many). In high-frequency financial volatility forecasting, the input may be a stream of quotes and trades over the past 3 minutes and the output would be the volatility prediction (many-to-one). Most interestingly, the many-to-many architecture can handle applications where the input and output sequence lengths are not the same using the encoder/decoder setup shown in the bottom right of the diagram above. Many-to-many models are commonly referred to as sequence-to-sequence models in the literature.

Lesson 3: How do language models and sequence generation work?

Language models make predictions by estimating the probability of the next word given the words that precede it. After you’ve trained a language model, the conditional distributions you’ve estimated may be used to sample novel sequences.

In the homework exercises, you train a language model on Shakespeare text and generate novel shakespearian sentences. Although the course only discusses language based sequence generation, there are various other applications in other fields. In finance, for example, you may use this type of model to generate sample stock paths. You could train the network on various 3 minute tick-by-tick intervals for a single name and then use the network to generate sample paths.

Lesson 4: Vanishing gradients with RNNs

RNNs may have gradients that vanish exponentially fast making it difficult for the network to learn long-term dependencies. Exploding gradients are less of the problem since you could easily apply a simple gradient clipping algorithm. Vanishing gradients can also be difficult to spot making it more dangerous when deploying your system into production.

Lesson 5: Capturing long term dependencies

Gated Recurrent Units (GRUs) can be used to address the vanishing gradient problem by adding two gates (update and reset gates) which keep of track of the most relevant information for the prediction. The update gate is used to determine how much past information needs to be passed to the next time step. The reset gate is used to determine how much information is irrelevant and should be forgotten. LSTM cells may also be used to address long term dependencies.

Lesson 6: GRU vs LSTM cells

LSTM cells have an additional gate and are therefore more complex and take longer to train. In theory, LSTM cells should be able to remember longer sequences at the added cost of increased training time; however, there is no clear empirical evidence that either network outperforms the other in all cases. The instructor recommends starting with GRUs since they are simpler and more scalable than LSTM cells.

Lesson 7: Transfer Learning using Word Embeddings

Word embeddings can be thought of as the vector representation of a given word. They can be trained using the Word2Vec, Negative Sampling or Glove algorithms. Word embedding models may be trained on a very large text corpus (say 100B words) and can then be used on a sequence prediction task with a smaller number of training example (say 10,000 words). For example, sentiment classification may use word embeddings to greatly reduce the number of training examples required to generate an accurate model. In the diagram below, E, represents the word embeddings matrix.

Lesson 8: Designing an algorithm which is free of undesirable biases

Prediction tasks are being used to make increasingly important decisions. So how can we design an algorithm which is free of gender, ethnicity and social economic biases? Some researchers agree that this issue is easier to address in computer programs than in humans. Using a method similar to principal component analysis, we can identify the bias and non-bias subspaces. Then for each word that is non-definitional, we can project the word to have zero bias. Finally, we can equalize pairs of word to neutralize remaining bias. For example, we would want the distance between grandmother and babysitter and grandfather and babysitter to be equal.

Lesson 9: Machine translation using search algorithms

Search algorithms can be used to generate the most likely French sentence given an English sentence. Beam search is an algorithm that is commonly used for this task. The greediness of this algorithm is defined by the beam width parameter. If you want to be less greedy, you would set the beam width to a larger positive integer. When diagnosing errors, it’s possible to determine if the error is due to the beam search algorithm inaccuracy or your trained RNN model.

Lesson 10: Attention models in sequence-to-sequence models

Loosely speaking, attention models are based on the visual attention mechanism found in humans. The algorithm attempts to learn what to pay attention to based on the input sequence seen so far. Attention models are extremely useful in tasks such as neural machine translation. There is a homework assignment that gets you to implement this model yourself.

Lesson 11: Speech Recognition with sequence-to-sequence models

Sequence-to-sequence models allow a practitioner to take a simpler, end-to-end approach to speech recognition applications. Once upon a time, speech recognition systems were built with phonemes. With the rise of big data, however, we can use end-to-end deep learning and completely remove the manual phoneme and feature engineering steps.

Trigger word detection systems are the final application of speech recognition described in the course. You will implement such an algorithm in the homework exercises on your own. I personally trained a network to turn my lamp on and off. Trigger detection algorithms also have various applications in other fields such as financial economics; perhaps you could train an algorithm to detect events/spikes in a stock time series.


Overall, the vast number of applications that sequence models have make this course well worth your time. The homework assignments also give you practice implementing practical systems on your own. The lessons I explained above only represent a subset of the materials presented in the course. By taking this course, you will just scratch the surface of sequence models, but it may just be enough to kickstart an opportunity or career in artificial intelligence. I was not endorsed by for writing this article.

If you have any interesting applications of sequence models you would like to share, let me know in the comments below. I would be happy to discuss potential collaboration on new projects.

That’s all folks — if you’ve made it this far, please comment below and add me on LinkedIn.

My Github is here.

Other Deep Learning Course Blogs

Computer Vision by Andrew Ng — 11 Lessons Learned

Deep Learning Specialization by Andrew Ng — 21 Lessons Learned

Other Interesting Articles

Sign Language Recognition with HMM’s

Probabilistic Approaches to Combinatorial Optimization