Backpropagation Through Time

This post shows the step-by-step derivation of backpropagation through time and implementation codes.

Suppose we have a many-to-many RNN and we are using the square loss. The total loss is the sum of all square loss over all the timesteps.

The gradients in green boxes are used to update the parameters.

If you are using cross entropy loss instead, which usually uses softmax for output prediction, you need to use the derivative of softmax (shown above) instead, and all the steps after that are similar to using square loss.

In a many-to-one RNN, things are even simpler because the gradient for hidden unit vector only consists of the error propagated form futures steps (unlike many-to-many which also has an error from current timestep). Other than that, the other derivations are similar.

Now, we can implement BPTT with Numpy with a toy example.

First, let’s try te many-to-many model. We use a many-to-many RNN to simulate binary addition. At each step, two binary digits are used as input and the correct sum of these two digits is used as the label. We use sigmoid activation for both hidden layers and output layers, the output predicted value is between 0 to 1.

Implementation: (following my BPTT steps for many-to-many above)

import copy, numpy as np
np.random.seed(0)
# compute sigmoid nonlinearity
def sigmoid(x):
output = 1/(1+np.exp(-x))
return output
# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
return output*(1-output)
# training dataset generation
int2binary = {}
binary_dim = 8
largest_number = pow(2,binary_dim)
binary = np.unpackbits(
np.array([range(largest_number)],dtype=np.uint8).T,axis=1)
for i in range(largest_number):
int2binary[i] = binary[i]
# input variables
alpha = 0.1
#set dx
input_dim = 2
#set da
hidden_dim = 16
#set dy
output_dim = 1
# initialize neural network weights
Wa = 2*np.random.random((input_dim,hidden_dim)) - 1
ba = np.zeros((1, hidden_dim))
Wy = 2*np.random.random((hidden_dim,output_dim)) - 1
by = np.zeros((1, output_dim))
Wh = 2*np.random.random((hidden_dim,hidden_dim)) - 1
Wa_update = np.zeros_like(Wa)
ba_update = np.zeros_like(ba)
Wy_update = np.zeros_like(Wy)
by_update = np.zeros_like(by)
Wh_update = np.zeros_like(Wh)
# training
for j in range(10000):

# generate a simple addition problem (a + b = c)
a_int = np.random.randint(largest_number/2) # int version
a = int2binary[a_int] # binary encoding
b_int = np.random.randint(largest_number/2) # int version
b = int2binary[b_int] # binary encoding
c_int = a_int + b_int
c = int2binary[c_int]

# where we'll store our best guess (binary encoded)
d = np.zeros_like(c)
overallError = 0

#list for k_deltas
k_deltas = list()
#list for a values
a_values = list()
#a<0> initializaed as 0 vector
a_values.append(np.zeros(hidden_dim))

# moving along the positions in the binary encoding
for position in range(binary_dim):

# generate input and output
#input binary is processed in reverse order
X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]])
y = np.array([[c[binary_dim - position - 1]]]).T
# hidden layer (input ~+ prev_hidden)
a_t = sigmoid(np.dot(X,Wa) + ba + np.dot(a_values[-1],Wh))
# output layer (new binary representation)
pred_yt = sigmoid(np.dot(a_t,Wy) + by)
# did we miss?... if so, by how much?
#loss function: 0.5*(y_pred - y)**2
pred_yt_error = pred_yt - y
k_deltas.append((pred_yt_error)*sigmoid_output_to_derivative(pred_yt))
overallError += (0.5*pred_yt_error[0]**2)

# store decoded predicted digit
d[binary_dim - position - 1] = np.round(pred_yt[0][0])

# store hidden layer so we can use it in the next timestep
a_values.append(copy.deepcopy(a_t))

#h<T+1>_delta initialized as 0 vector
future_ht_delta = np.zeros(hidden_dim)

for position in range(binary_dim):
#backpropagate from last timestep
X = np.array([[a[position],b[position]]])
at = a_values[-position-1]
#get a<t-1> for later use
prev_at = a_values[-position-2]

# error at output layer
kt_delta = k_deltas[-position-1]
# error at hidden layer
ht_delta = (kt_delta.dot(Wy.T) + future_ht_delta.dot(Wh.T)) * sigmoid_output_to_derivative(at)
# let's update all our weights so we can try again
Wy_update += np.atleast_2d(at).T.dot(kt_delta)
by_update += kt_delta
Wh_update += np.atleast_2d(prev_at).T.dot(ht_delta)
Wa_update += X.T.dot(ht_delta)
ba_update += ht_delta

future_ht_delta = ht_delta

Wy -= Wy_update * alpha
by -= by_update * alpha
Wh -= Wh_update * alpha
Wa -= Wa_update * alpha
ba -= ba_update * alpha
#reset to 0 for next example
Wy_update *= 0
by_update *= 0
Wh_update *= 0
Wa_update *= 0
ba_update *= 0

# print out progress
if(j % 1000 == 0):
print ("Error:" + str(overallError))
print ("Pred:" + str(d))
print ("True:" + str(c))
out = 0
for index,x in enumerate(reversed(d)):
out += x*pow(2,index)
print (str(a_int) + " + " + str(b_int) + " = " + str(out))
print ("------------")

Sample output: (you can reproduce the result by setting the same random seed)

Error:[0.81597212]
Pred:[0 0 0 0 0 0 0 1]
True:[0 1 0 0 0 1 0 1]
9 + 60 = 1
------------
Error:[0.8360905]
Pred:[1 1 1 1 1 1 1 1]
True:[0 0 1 1 1 1 1 1]
28 + 35 = 255
------------
Error:[0.96656603]
Pred:[0 1 0 0 1 0 0 0]
True:[1 0 1 0 0 0 0 0]
116 + 44 = 72
------------
Error:[0.83946455]
Pred:[1 1 0 1 1 1 1 1]
True:[0 1 0 0 1 1 0 1]
4 + 73 = 223
------------
Error:[0.85176301]
Pred:[0 1 0 0 1 0 0 0]
True:[0 1 0 1 0 0 1 0]
71 + 11 = 72
------------
Error:[0.50600772]
Pred:[1 0 1 0 0 0 1 0]
True:[1 1 0 0 0 0 1 0]
81 + 113 = 162
------------
Error:[0.00943078]
Pred:[0 1 0 1 0 0 0 1]
True:[0 1 0 1 0 0 0 1]
81 + 0 = 81
------------
Error:[0.03681629]
Pred:[1 0 0 0 0 0 0 1]
True:[1 0 0 0 0 0 0 1]
4 + 125 = 129
------------
Error:[0.00784794]
Pred:[0 0 1 1 1 0 0 0]
True:[0 0 1 1 1 0 0 0]
39 + 17 = 56
------------
Error:[0.00440828]
Pred:[0 0 0 0 1 1 1 0]
True:[0 0 0 0 1 1 1 0]
11 + 3 = 14
------------

From the output, you can see that the model is learning how to do binary number addition.

Next, we modify this task a bit to apply the many-to-one model. This step, we still use the binary digits as input, but instead of predicting the resultant binary digit as every timestep, we directly predict the base-10 (decimal) result at the end. Since we are directly predicting the base-10 (decimal) result, we do not need to use sigmoid activation for output anymore. This only requires a minor change of the formula for kt_delta, which is reflected in my codes below. Note that this time, we only update Wy and by at the last timestep.

Implementation: (following my BPTT steps for many-to-one above)

import copy, numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
# compute sigmoid nonlinearity
def sigmoid(x):
output = 1/(1+np.exp(-x))
return output
# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
return output*(1-output)
# training dataset generation
int2binary = {}
binary_dim = 8
largest_number = pow(2,binary_dim)
binary = np.unpackbits(
np.array([range(largest_number)],dtype=np.uint8).T,axis=1)
for i in range(largest_number):
int2binary[i] = binary[i]
# input variables
alpha = 0.1
#set dx
input_dim = 2
#set da
hidden_dim = 16
#set dy
output_dim = 1
errors = []
# initialize neural network weights
Wa = 2*np.random.random((input_dim,hidden_dim)) - 1
ba = np.zeros((1, hidden_dim))
Wy = 2*np.random.random((hidden_dim,output_dim)) - 1
by = np.zeros((1, output_dim))
Wh = 2*np.random.random((hidden_dim,hidden_dim)) - 1
Wa_update = np.zeros_like(Wa)
ba_update = np.zeros_like(ba)
Wy_update = np.zeros_like(Wy)
by_update = np.zeros_like(by)
Wh_update = np.zeros_like(Wh)
# training
for j in range(100000):

# generate a simple addition problem (a + b = c)
a_int = np.random.randint(largest_number/2) # int version
a = int2binary[a_int] # binary encoding
b_int = np.random.randint(largest_number/2) # int version
b = int2binary[b_int] # binary encoding
c_int = a_int + b_int

overallError = 0

#list for a values
a_values = list()
#a<0> initializaed as 0 vector
a_values.append(np.zeros(hidden_dim))

# moving along the positions in the binary encoding
for position in range(binary_dim):

# generate input and output
#input binary is processed in reverse order
X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]])
y = c_int
# hidden layer (input ~+ prev_hidden)
a_t = sigmoid(np.dot(X,Wa) + ba + np.dot(a_values[-1],Wh))
# output layer
# only the last timestep has output,
# and only last timestep can update Wy, by
if position == binary_dim-1:
pred_yt = np.dot(a_t,Wy) + by
#loss function: 0.5*(y_pred - y)**2
#no more sigmoid
pred_yt_error = pred_yt[0][0] - y

kT_delta = np.array([[pred_yt_error]])
overallError = (0.5*pred_yt_error**2)

# store decoded predicted digit
final_pred = np.round(pred_yt[0][0])

# store hidden layer so we can use it in the next timestep
a_values.append(copy.deepcopy(a_t))

#h<T+1>_delta initialized as 0 vector
future_ht_delta = np.zeros(hidden_dim)

for position in range(binary_dim):
#backpropagate from last timestep
X = np.array([[a[position],b[position]]])
at = a_values[-position-1]
#get a<t-1> for later use
prev_at = a_values[-position-2]

# error at hidden layer
ht_delta = (future_ht_delta.dot(Wh.T)) * sigmoid_output_to_derivative(at)
# let's update all our weights so we can try again
if position == 0:
Wy_update = np.atleast_2d(at).T.dot(kT_delta)
by_update = kT_delta
Wh_update += np.atleast_2d(prev_at).T.dot(ht_delta)
Wa_update += X.T.dot(ht_delta)
ba_update += ht_delta

future_ht_delta = ht_delta

Wy -= Wy_update * alpha
by -= by_update * alpha
Wh -= Wh_update * alpha
Wa -= Wa_update * alpha
ba -= ba_update * alpha
#reset to 0 for next example
Wy_update *= 0
by_update *= 0
Wh_update *= 0
Wa_update *= 0
ba_update *= 0

# print out progress
if(j % 1000 == 0):
print ("Error:" + str(overallError))
errors.append(overallError)
# print ("Pred:" + str(final_pred))
# print ("True:" + str(c_int))
# print (str(a_int) + " + " + str(b_int) + " = " + str(final_pred))
# print ("------------")
plt.plot(errors)
plt.xlabel('iterations')
plt.ylabel('loss')
plt.show()

I also plotted the loss curve, and you can see that the loss is indeed decreasing.

Sample output:

Pred:-1.0
True:69
9 + 60 = -1.0
------------
Pred:114.0
True:95
18 + 77 = 114.0
------------
Pred:128.0
True:155
39 + 116 = 128.0
------------
Pred:173.0
True:170
95 + 75 = 173.0
------------
Pred:116.0
True:80
67 + 13 = 116.0
------------
Pred:192.0
True:181
99 + 82 = 192.0
------------
Pred:149.0
True:147
46 + 101 = 149.0
------------
Pred:130.0
True:112
5 + 107 = 130.0
------------
Pred:198.0
True:182
114 + 68 = 198.0
------------
Pred:163.0
True:200
109 + 91 = 163.0

From the output, you may notice that it is harder to train the RNN to learn to add binary numbers to directly get the decimal result. This is because that in this case, the model not only needs to learn to add two binary numbers but also needs to learn how to convert the binary sum into the decimal result, while in the previous many-to-many example the RNN only needs to learn to do binary digit addition, which is easier.