Source: Deep Learning on Medium

Humans don’t start their thinking from scratch every second.Lets say you are thinking about something most of the time your thoughts are related to past events and thats how humans process sequential data.

Traditional neural networks can’t do this , and it seems like a major shortcoming.For ,example imagine you want to classify what kind of event is happening at every point in a movie.It’s unclear how a traditional neural network could use its reasoning about previous events to inform later ones.

### So what is a Recurrent Neural Network?

A Recurrent Neural Network is an extension of a conventional feedforward neural network ,which is able to handle a variable-length sequence input.The RNN handles the variable sequence by having a recurrent hidden state activation whose activation at each time is dependent on that the previous time.

More formally ,given a sequence x = (x1; x2; …..; xT ) the RNN updates its recurrent hidden state ht by

where ϕ is a nonlinear function such as composition of a logistic sigmoid with an affine transformation .Optionally ,the RNN may have an output y=(y1,y2,….yT) which may be of variable length.

In the above diagram , a chunk of neural network,**A** ,looks at some input **xt **and outputs a value **ht .**A loop allows information to be passed from one step of the networks to the next.

**Types of RNN Architectures**

Each rectangle is a vector and arrows represent functions (e.g matrix multiply).Input vector are red,output vectors are in blue and green vectors hold the RNN’s state .From left to right:

**One to one**

Vanilla mode of processing without RNN,from fixed-sized input to fixed-sized output(e.g image classification)

**One to many**

Sequence output for example we feed an image into an RNN and it outputs a sequence of words.

**Many to one**

Sequence input for example we input a sentence “There is nothing to like in this movie” and it outputs a rating or review for example 1 out of 5 stars which is called sentiment analysis.

**Many to many**

Sequence input and sequence output for example machine translation reads a sentence in english and outputs a sentence in french.

**Many to Many(2)**

Synced sequence input and output (e.g video classification where we wish to label each frame of the video)

*Notice that every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.*

### RNN Computation

At the core ,RNNs have a deceptively simple API. They accept an input vector** x **and give you an output vector **y.**However ,crucially this output vector’s contents are influenced not only by the input you just fed in , but also on the entire history of inputs you’ve fed in in the past.Written as a class , the RNN’S API consists of a single **step** function:

`rnn `**=** RNN()

y **=** rnn**.**step(x) *# x is an input vector, y is the RNN's output vector*

The RNN class has some internal state that it gets to update every time **step** is called.In the simplest case this state of a single hidden vector **h**.Below is an implementation of the step function in a Vanilla RNN:

**class** **RNN**:

*# ...*

**def** **step**(self, x):

*# update the hidden state*

self**.**h **=** np**.**tanh(np**.**dot(self**.**W_hh, self**.**h) **+** np**.**dot(self**.**W_xh, x))

*# compute the output vector*

y **=** np**.**dot(self**.**W_hy, self**.**h)

**return** y

The above specifies the forward pass of a vanilla RNN. This RNN’s parameters are three matrices **W_hh**,**W_xh**,**W_hy**. The hidden state **self.h** is initialized with the zero vector. The **np.tanh **function implements a nin-linearity that squashes the activations to the range** [-1,1]**.

Notice briefly how this works: There are two terms inside of the tanh: one is based on the previous hidden state and one is based on the current input. In numpy **np.dot **is matrix multiplication .The two intermediates interact with addition , and then get squashed by the **tanh **into new state vector.

The Math notation for the hidden state update is-

We initialize the matrices of the RNN with random numbers and the bulk of work during training goes into finding the matrices that give rise to desirable behavior , as measured with some loss function that expresses your preference to what kinds of outputs **y** you’d like to see to your input sequences **x**.

**W_hh: **Matrix based on the previous Hidden State

**W_xh:**Matrix based on the Current Input

**W_hy**:Matrix based between hidden state and output

### Character-Level Language Models

We’ll train RNN character-level language models.That is , we’ll give the RNN a huge chunk of text and ask it to model the probability distribution of the next character in the sequence given a sequence of previous characters.This will allow us to generate new text one character at a time.

As a working example ,suppose we only had a vocabulary of four possible letters “helo”, and wanted to train an RNN on the training sequence “hello”.This training sequence is in fact a sequence of 4 separate training examples:

- The probability of “e” should be likely given the context of “h”
- “l” should be likely in the context of “he”
- “l” should also be likely given the context of “hel” , and finally
- “o” should be likely given the context of “hell”

Each character is encoded into a vector using 1-of-k encoding (i.e all zero except for a single one at the index of the character in the vocabulary), and feed them into the RNN one at a time with the **step** function.We will then observe a sequence of 4-dimensional output vectors (one dimension per character),which we interpret as the confidence the RNN assigns to each character coming next in the sequence.Here’s a diagram:

For example , we see that in the first step when the RNN saw the character “h” it assigned confidence of 1.0 to the next letter being “h”,2.2 to letter “e”,-3.0 to “I”,and 4.1 to “o”.Since our training data (the string “hello”) the next character is “e” , we would like to increase its confidence(green) and decrease the confidence of all other letters (red).Similarly , we have desired target character at every one of the 4 time steps that we’d like the network to assign a greater confidence to.Since RNN consists entirely of differentiable operations we can run the backpropagation algorithm(this is just a recursive application of the chain rule from calculus) to figure out what direction we should adjust every one of its weights to increase the scores of the correct targets (green bold numbers).We can then perform a parameter update ,which nudges every weight a tiny amount in gradient direction.If we feed the same inputs to the RNN after parameter update we would find that the scores of the characters (e.g “e” in the first step) would be slightly higher (e.g 2.3 instead oif 2.2),and the scores of incorrect characters would be slightly lower.We then repeat the process over and over many times until the network converges and its predictions are eventually consistent with training data in that correct characters are always predicted next.

### Problem of RNNs

Lets say we want our RNN to predict the last word in the text “ I grew up in France… I speak fluent *French”*Recent information suggests that the next word is probably the name of a language ,but if we want to narrow down which language , we need the context of France , from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.Unfortunately as the gap grows ,RNNs become unable to learn to connect information.

The gradients tend to either vanish as the gap grows , this makes gradient based optimization method struggle.

New architecture based on RNNs where invented such as LSTM (Long Short Term Memory) and Gated Recurrent Units which are modified versions of RNNs.

References

- http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Emperical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling