Original article can be found here (source): Deep Learning on Medium
Recurrent Neural Networks in Deep Learning — Part 1
By Priyal Walpita
Reading this article will help you to understand the terms of Artificial Neural Networks (ANN), Drawbacks seen in ANN, Architecture view of RNN ( Recurrent Neural Networks ), ,Advantages of using RNN over ANN and how they work as well as how to construct a model of the series and solve various use cases. Intentionally, I kept this article based on theories and their interpretation, focusing primarily on the Recurrent Neural Networks (RNN).
This blog post consist with two section and this is the first section. This section is giving an intro to the RNN and the second section will discusses the types of RNN and few practical usages of RNN.
What Is Artificial Neural Network (ANN) ?
At every layer there is a community of several perceptrons / neurons. Often known as a Feed-Forward Neural Network or ANN, since inputs are processed only in the forward direction. And you can also see that, ANN (or deep neural network )is made up of 3 layers–Input, Hidden(one or many) and Output. The input layer accepts the inputs, the hidden layer(s) processes the inputs and the result is generated by the output layer. Essentially every layer is trying to persist some of the weights. But there are drawbacks when you come across ANN.
Drawbacks In Artificial Neural Network (ANN)
- ANN cannot capture the sequential information in the input data which is required for dealing with sequence data. ie: If two data elements are related to each other ( eg: speech recognition , text generation , text or voice semantic recognition etc… ) we cannot treat each data element separately.
So, to overcome this drawback we are going to use Recurrent Neural Network (RNN).
Let’s first seek to explain the Architecture viewpoint gap between an RNN and an ANN:
“The secret layer of ANN turns the looping constraint into RNN”
RNN has a recurrent relation on the hidden state, as you can see from this diagram above. This looping constraint ensures the capture of sequential information in the input data.
What Is Recurrent Neural Network (RNN) ?
In this Neural Network the previous step output is fed as input to the current step. Traditional neural networks took place before the development of the Recurrent Neural Networks, means that all inputs and outputs are independent of each other, but in cases such as when it is necessary to predict the next word of a sentence, preceding words are needed and thus it is important to remember the preceding words. So, only RNN came up after that which solved this problem with the help of a Hidden Layer. RNN’s key and most significant feature is Hidden State, which remembers some details on a sequence.
RNN has a “memory” in which all knowledge about what was measured is recalled. To generate the output, it uses the same parameters for each as it performs the same function on all inputs or hidden layers. Unlike other neural networks this reduces the complexity of the parameters.
Advantages Of Recurrent Neural Network (RNN)
• RNN captures the sequential information found in the input data, i.e. connection between words in the text when predicting the following:
As you can see here, the output (o1, o2, o3, o4) depends not only on the current word but also on the previous words in-time step.
How Recurrent Neural Network (RNN) Work ?
Let’s take an example to understand this approach :
There is a hidden layer in the traditional neural networks, with its own collection of weights and biases. For weight and bias 1 respectively, let’s say this weight and bias be w1 and b1. Similarly, for the third layer we will have w2, b2, and w3, b3. These layers are also separate from each other, meaning the previous output is not memorized. Yet assume a deeper network exists with one layer of input, three hidden layers, and one layer of output.
The recurrent neural network works as follows:
1) First RNN will do the conversion of independent activations into dependent activations. It also assigns the same weight and bias to all layers which further reduces the complexity of RNN parameters and provides a consistent framework for memorizing previous outputs by providing an input to the next layer.
2) These three layers of the same weights and bias merge into one single recurring structure.
Formula for calculating the current state
Formula for applying Activation function (tanh) :
Formula for calculating output :
How Training Through RNN Works ?
- The network takes one single step in the time of input.
- Along with the current input and the previous one, we determine the current state.
- Now, for the next state, the current state output ht is ht-1.
- N Number of steps can be taken and all the details can be added at the end.
- After all steps have been completed, the final step is to measure the output.
- Lastly, by measuring the difference between the real output and the expected output, we calculate the error.
- In order to change the weights and achieve a better result, the error is propagated back to the network.
Deep Learning Models : Recurrent Neural Network
The sequence models are among the most interesting fields of the deep learning architecture. For the sequence model, the principal example we can denote is the Recurrent Neural Network.
What Is Sequence Model In Deep Learning ?
Sequence Models are a general class of deep learning that deals with sequences such as voice, text or any sequence of inputs in this regard. So at this point it’s nice to highlight some of the models we use to solve these problems, and they’re Recurrent Neural Network (RNNs), Long Short Term Memory (LSTM) models and several variants of them like GRUs (Gated Recurrent Units) etc.
Before we take a closer look at these sequence models (Recurrent Neural Network), let’s look at some of the areas where we can use these models to work. And, as we continue to study these models in detail, it is useful to have a clear knowledge of the use cases.
Examples Of Sequence Data
- Speech Recognition : We get an audio clip that is “x” as an input and then we have to create a model that will map this audio clip to a text transcript “y.” Here, we discover that both input and output are data sequences.
- Music Generation : It may be an empty set as input, or it may be a single integer, genre, or even the first few notes of the music you want. Therefore, it is possible to denote the letter “x” as nothing or even just an integer and eventually the result will be a piece of music which is a sequence.
- Sentiment Classification : The input will be a summary sequence, and we have to model it with a rating output (An integer between 1 and 5).
- DNA Sequence Analysis : Here an input DNA sequence is given and you need to find out which portion or subsequence of that DNA is a protein.
- Machine Translation : In this, the input will be a sequence in another language that says in “French” and the output will also be a sequence in another language that says in “English.”
- Video Activity Recognition : The input here could be a sequence of video frames, and the output will be a label to predict what the video frame activity is like “running,” “swimming,”, “walking” etc.
- Name Entity Recognition : The input will be a sequence (means a sentence), and in that sequence we should recognize all entities, such as the location, people, organization, etc. And the output will be integers (or labels).
Finally as you can see, Sequence Models solves a wide variety of problems starting from where both input and output are sequences to situations where one is a sequence and the other is not. But all the above issues can also be categorized as Supervised Learning, as they all have labels that we construct our models using.
With all these in mind , let’s define a framework using which will be able to solve these problems.
Building A Sequence Model Framework For These Above Problems
Let’s select the question of acknowledgement of entity mentioned above. We got an input “x” there that goes like this:
“ Harry Potter and Hermione Granger invented a new spell ”.
Here the above sentence speaks of’ Harry Potter’ and’ Hermione Granger’ which is type’ person.’ Search engines usually perform this called entity recognition and index term based on saying last 24 hours news so that if they’re checked the results come up on what’s going on with them. So, as you see my input is a sequence of nine words so we’ll have nine features to represent these set of words.
We use the notation x(t) to represent a word in a sentence, below you can see it in a table :
The size of x is then specified as Tx.
After the processing of RNN on the x(t) input, the y(t) output parameters are generated, the y size is defined as Ty. We’ve learned Ty and Tx should not be comparable. For example, if our model verifies the phrases are names (including the name of the person or location), then the y will be a vector whose element is either 1 or 0, 1 means that this term is part of the name. We can know, based on our knowledge, which words are names in this sentence:
Ultimately, the above case is only one sample case, normally several samples are processed during Deep Learning research, so the other subscription is added to denote the index of the sample:
• Superscript (i) refers to an object associated with an example of an i th.
Example: x(i) is the input of an example of the i th training.
• Superscript (t) denotes a t th time -step for an object.
Example: x(t) is the t th time-step input x. X(i)(t) is the input in the Example i t th time step.
This is a very important task for issues concerning the NLP (Natural Language Process). The one-hot vector is commonly used to describe a single word in the above example. Ex :Harry
The first thing we can do is plan a word dictionary, and assume the dictionary size is 10,000. Thus the vector describing a word is a one-hot vector containing 10000 elements, only the target word index in the dictionary is set 1, other elements are 0, for example, “Harry” vector is as follows:
Why Not A Standard Network ?
The first thought that would emerge is what if I had all of these–hot encoded vectors go as an input into a neural network, have some hidden layers and then an output layer predicts the output vector ‘ y.’
The issues arising with this is :
- While inputs and outputs may have different lengths in various cases, the length of the sentence may not be limited to any number, the sentence may be as long as it wishes.
- Another concern is that the meaning of terms is not captured and extended in the training set.
In short, we’d say using a traditional neural network won’t train the model well and capture the patterns, and also the number of parameters will burst as it turns out to be a weight matrix (vocabulary size, number of examples).
This is the position we have to use Recurrent Neural Networks (RNNs) that have no such limitations.
Ex: Let’s see that if you read the sentence from left to right and you also have the output labeled “y” vector that you’re trying to learn (mapping between “x” and “y”). You move it to a neural network (hidden layer) and seek to determine whether it is a “name” in our example or not. Now when the second word goes in time step-2, it also takes time step-1 activation and tries to predict whether it is a “name” or not and it continues until the last time stage where the sentence ends.
But here it is worth noting that RNN starts to read the sentence from the left so that any word comes up at the time “t” is not factored into “t-1,” which may also be significant.
But based on the above-mentioned RNN architecture let’s try to write a functional model of what we can work with for now.
Forward Propagation In RNNs
These are the blocks of RNN. Using this now how we can train our model, that means also how we can do back propagation on it. (In any deep learning system, the time you provide the forward prop structure, the device, the time you provide the forward prop structure, the backprop construct takes care).
You start right in backprop and go all the way to the left. Here, you keep calculating the loss for individual words and then take a sum of all the losses and then keep calculating the derivatives for backprop with Gradient Descent at each step from right to left and that’s how you compete with the optimal (w, b). The way you practice your RNN is this way.
Following 2 pictures depicts the forward and back propagation of a RNN.