# Only Numpy: Dilated Back Propagation and Google Brain’s Gradient Noise with Interactive Code

So yesterday I found this paper “Dilated Recurrent Neural Networks” from NIPS 2017 and implemented here . But then something hit me, Res Net and High Way net are built in a way that allows direction connection between the input data X and the transformed data X`.

Why can’t we do the exact same for back propagation as well? Connect the gradient from previous layers to deeper layers…..

I mean, if you are not using frameworks to perform auto differentiation, why don’t we connect the gradients from latest layer to more deeper layer, and just see how that goes? In this post, we’ll do exactly that, also lets go one step further and compare with a model that applies Google Brain’s Gradient Noise.

Since I got the inspiration after reading the Dilated RNN, I’ll just call this Dilated Back Propagation, however if anyone knows other papers where they performed back propagation in this fashion, please let me know in the comment section. Also, I will assume you already have read my Blog post about Implementing Dilated RNN, if not please click here.

Network Architecture (Feed Forward Direction)

Red Circle → Final Output of the Network, a (1*10) vector for One hot encoding of predicted number

Brown Circle → Hidden State 0 for Layer 2
Lime Circle → Hidden States for Layer 2

Pink Circle → Hidden State 0 for Layer 1
Black Circle → Hidden State for Layer 1

Blue Numbers 1,2,3,4,5 → Input for each Time Stamp (Please TAKE NOTE of this since I am going to use this knowledge to explain the Training / Test Data)

Pinkish? Arrow → Direction of Feed Forward Operation,

As seen above, the network architecture is exactly the same as the previous post. However there is one thing I changed and that is the input data for each time stamp.

Training Data / Test Data

Pink Box → Input at Time Stamp 1 (Vectorized 14*14 Pixel Image)
Yellow Box → Input at Time Stamp 2 (Vectorized 14*14 Pixel Image)
Blue Box → Input at Time Stamp 3 (Vectorized 14*14 Pixel Image)
Purple Box → Input at Time Stamp 4 (Vectorized 14*14 Pixel Image)
Green Box → Input at Time Stamp 5 (Vectorized 14*14 Pixel Image)

Despite some images looking bigger then other, all of them are (14*14) pixel images. And each of the image are made by applying different kind of pooling operation to the original image which is (28*28) pixel image. Each of the pooling operations are described below.

Pink Box → Mean Pooling Using np.mean function
Yellow Box → Variance Pooling Using np.var function
Blue Box → Max Pooling Using np.max function
Purple Box → Standard Deviation Pooling Using np.std function
Green Box → Median Pooling Using np.median function

Below is the code of achieving this.

And with that in mind, lets take a look at other training data. Finally, the reason I did this, is simply put, I wanted to.

Case 1: Normal Back Propagation

Purple Arrow → Standard Direction of Gradient Flow

Above is the normal (or standard) back propagation, where we compute the gradient in each layer, pass them on to the next layer. And each of weights at different time stamps uses them to update their weights and the gradient flow continuous on.

Purple Arrow → Standard Direction of Gradient Flow

Again, the purple arrow represent standard gradient flow, however this time before updating each weights we are going to add some Noise to the Gradient. Below is the screen shot of how we can achieve this.

Case 3: Dilated Back Propagation

Purple Arrow → Standard Direction of Gradient Flow
Black Arrows → Dilated Back Propagation where we pass on some portion of the gradient to the previous layers, which are not directly connected.

Now, here we introduce our new theory in hopes to improve the accuracy of the model. There are two things to note here.

1. We are only going to pass on some portion of the gradient to the previous layers.

As seen above, we have some variable called ‘decay proportion rate’ and we are going to use inverse time decay rate to decrease the amount of gradient it can pass on to the previous layers as time goes on. As seen in the Green Box, since we multiply the gradients from future layers with decay proportion rate, as training continuous the amount of Dilated Gradient flow decreases.

2. The Dilated Gradient Flow skips every 2 Layers.

As seen above in the Red Box, the gradient at time stamp 5, ONLY goes to the gradient at time stamp 3. However this architecture can be further explored to make the gradient flow much more denser.

Purple Arrow → Standard Direction of Gradient Flow
Black Arrows → Dilated Back Propagation where we pass on some portion of the gradient to the previous layers, which are not directly connected.

Here we are not only adding Gradient Noise to each weight update, but also making the Gradient Flow better.

Training and Results (Google Colab, Local Setting)

Above are results when running the code on Google Colab. The accuracy bar represents model’s correct guesses for 100 test images. Unfortunately I forgot to print out the exact accuracy rate but we can see that Case 2 (Google Brain Gradient Noise) had the highest accuracy. Also, cases with non standard back propagation performed better than standard back propagation. In the cost over time function, we can see that standard back propagation had the highest cost rate.

Above are results when running the code on my local laptop. The accuracy bar represents model’s correct guesses for 100 test images. Here it was interesting to see Case 3(Dilated Back Propagation) under performing when compared to standard back propagation. However combination of Dilated Back propagation and Google Brain’s Gradient noise have out performed every model.

Interactive Code

I moved to Google Colab for Interactive codes! So you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding!

Final Words

I love frameworks such as Tensorflow, and Keras. However, I strongly believe we need to explore more different ways to perform back propagation.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also did comparison of Decoupled Neural Network here if you are interested.

Reference

1. Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W., … & Huang, T. S. (2017). Dilated recurrent neural networks. In Advances in Neural Information Processing Systems (pp. 76–86).
2. Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., & Martens, J. (2015). Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807.
3. Seo, J. D. (2018, February 14). Only Numpy: NIPS 2017 — Implementing Dilated Recurrent Neural Networks with Interactive Code. Retrieved February 15, 2018, from https://towardsdatascience.com/only-numpy-nips-2017-implementing-dilated-recurrent-neural-networks-with-interactive-code-e83abe8c9b27
4. Index. (n.d.). Retrieved February 15, 2018, from https://docs.scipy.org/doc/numpy/genindex.html
5. ”tf.train.inverse_time_decay | TensorFlow”, TensorFlow, 2018. [Online]. Available: https://www.tensorflow.org/api_docs/python/tf/train/inverse_time_decay. [Accessed: 16- Feb- 2018]. Source: Deep Learning on Medium