# Backpropagation Through Time

This post shows the step-by-step derivation of backpropagation through time and implementation codes.

Suppose we have a many-to-many RNN and we are using the square loss. The total loss is the sum of all square loss over all the timesteps.

The gradients in green boxes are used to update the parameters.

If you are using cross entropy loss instead, which usually uses softmax for output prediction, you need to use the derivative of softmax (shown above) instead, and all the steps after that are similar to using square loss.

In a many-to-one RNN, things are even simpler because the gradient for hidden unit vector only consists of the error propagated form futures steps (unlike many-to-many which also has an error from current timestep). Other than that, the other derivations are similar.

Now, we can implement BPTT with Numpy with a toy example.

First, let’s try te many-to-many model. We use a many-to-many RNN to simulate binary addition. At each step, two binary digits are used as input and the correct sum of these two digits is used as the label. We use sigmoid activation for both hidden layers and output layers, the output predicted value is between 0 to 1.

Implementation: (following my BPTT steps for many-to-many above)

`import copy, numpy as npnp.random.seed(0)`
`# compute sigmoid nonlinearitydef sigmoid(x):    output = 1/(1+np.exp(-x))    return output`
`# convert output of sigmoid function to its derivativedef sigmoid_output_to_derivative(output):    return output*(1-output)`
`# training dataset generationint2binary = {}binary_dim = 8`
`largest_number = pow(2,binary_dim)binary = np.unpackbits(    np.array([range(largest_number)],dtype=np.uint8).T,axis=1)for i in range(largest_number):    int2binary[i] = binary[i]`
`# input variablesalpha = 0.1#set dxinput_dim = 2#set dahidden_dim = 16 #set dyoutput_dim = 1`
`# initialize neural network weightsWa = 2*np.random.random((input_dim,hidden_dim)) - 1ba = np.zeros((1, hidden_dim))Wy = 2*np.random.random((hidden_dim,output_dim)) - 1by = np.zeros((1, output_dim))Wh = 2*np.random.random((hidden_dim,hidden_dim)) - 1`
`Wa_update = np.zeros_like(Wa)ba_update = np.zeros_like(ba)Wy_update = np.zeros_like(Wy)by_update = np.zeros_like(by)Wh_update = np.zeros_like(Wh)`
`# training for j in range(10000):        # generate a simple addition problem (a + b = c)    a_int = np.random.randint(largest_number/2) # int version    a = int2binary[a_int] # binary encoding`
`b_int = np.random.randint(largest_number/2) # int version    b = int2binary[b_int] # binary encoding`
`# true answer    c_int = a_int + b_int    c = int2binary[c_int]        # where we'll store our best guess (binary encoded)    d = np.zeros_like(c)`
`overallError = 0        #list for k_deltas    k_deltas = list()    #list for a values    a_values = list()    #a<0> initializaed as 0 vector    a_values.append(np.zeros(hidden_dim))        # moving along the positions in the binary encoding    for position in range(binary_dim):                # generate input and output        #input binary is processed in reverse order        X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]])        y = np.array([[c[binary_dim - position - 1]]]).T`
`# hidden layer (input ~+ prev_hidden)        a_t = sigmoid(np.dot(X,Wa) + ba + np.dot(a_values[-1],Wh))`
`# output layer (new binary representation)        pred_yt = sigmoid(np.dot(a_t,Wy) + by)`
`# did we miss?... if so, by how much?        #loss function: 0.5*(y_pred - y)**2        pred_yt_error = pred_yt - y        k_deltas.append((pred_yt_error)*sigmoid_output_to_derivative(pred_yt))        overallError += (0.5*pred_yt_error[0]**2)            # store decoded predicted digit         d[binary_dim - position - 1] = np.round(pred_yt[0][0])                # store hidden layer so we can use it in the next timestep        a_values.append(copy.deepcopy(a_t))        #h<T+1>_delta initialized as 0 vector    future_ht_delta = np.zeros(hidden_dim)        for position in range(binary_dim):        #backpropagate from last timestep         X = np.array([[a[position],b[position]]])        at = a_values[-position-1]        #get a<t-1> for later use        prev_at = a_values[-position-2]                # error at output layer        kt_delta = k_deltas[-position-1]        # error at hidden layer        ht_delta = (kt_delta.dot(Wy.T) + future_ht_delta.dot(Wh.T)) * sigmoid_output_to_derivative(at)`
`# let's update all our weights so we can try again        Wy_update += np.atleast_2d(at).T.dot(kt_delta)        by_update += kt_delta        Wh_update += np.atleast_2d(prev_at).T.dot(ht_delta)        Wa_update += X.T.dot(ht_delta)        ba_update += ht_delta                future_ht_delta = ht_delta        #gradient descent    Wy -= Wy_update * alpha    by -= by_update * alpha    Wh -= Wh_update * alpha     Wa -= Wa_update * alpha    ba -= ba_update * alpha`
`#reset to 0 for next example    Wy_update *= 0    by_update *= 0    Wh_update *= 0    Wa_update *= 0    ba_update *= 0        # print out progress    if(j % 1000 == 0):        print ("Error:" + str(overallError))        print ("Pred:" + str(d))        print ("True:" + str(c))        out = 0        for index,x in enumerate(reversed(d)):            out += x*pow(2,index)        print (str(a_int) + " + " + str(b_int) + " = " + str(out))        print ("------------")`

Sample output: (you can reproduce the result by setting the same random seed)

`Error:[0.81597212]Pred:[0 0 0 0 0 0 0 1]True:[0 1 0 0 0 1 0 1]9 + 60 = 1------------Error:[0.8360905]Pred:[1 1 1 1 1 1 1 1]True:[0 0 1 1 1 1 1 1]28 + 35 = 255------------Error:[0.96656603]Pred:[0 1 0 0 1 0 0 0]True:[1 0 1 0 0 0 0 0]116 + 44 = 72------------Error:[0.83946455]Pred:[1 1 0 1 1 1 1 1]True:[0 1 0 0 1 1 0 1]4 + 73 = 223------------Error:[0.85176301]Pred:[0 1 0 0 1 0 0 0]True:[0 1 0 1 0 0 1 0]71 + 11 = 72------------Error:[0.50600772]Pred:[1 0 1 0 0 0 1 0]True:[1 1 0 0 0 0 1 0]81 + 113 = 162------------Error:[0.00943078]Pred:[0 1 0 1 0 0 0 1]True:[0 1 0 1 0 0 0 1]81 + 0 = 81------------Error:[0.03681629]Pred:[1 0 0 0 0 0 0 1]True:[1 0 0 0 0 0 0 1]4 + 125 = 129------------Error:[0.00784794]Pred:[0 0 1 1 1 0 0 0]True:[0 0 1 1 1 0 0 0]39 + 17 = 56------------Error:[0.00440828]Pred:[0 0 0 0 1 1 1 0]True:[0 0 0 0 1 1 1 0]11 + 3 = 14------------`

From the output, you can see that the model is learning how to do binary number addition.

Next, we modify this task a bit to apply the many-to-one model. This step, we still use the binary digits as input, but instead of predicting the resultant binary digit as every timestep, we directly predict the base-10 (decimal) result at the end. Since we are directly predicting the base-10 (decimal) result, we do not need to use sigmoid activation for output anymore. This only requires a minor change of the formula for kt_delta, which is reflected in my codes below. Note that this time, we only update Wy and by at the last timestep.

Implementation: (following my BPTT steps for many-to-one above)

`import copy, numpy as npimport matplotlib.pyplot as plt np.random.seed(0)`
`# compute sigmoid nonlinearitydef sigmoid(x):    output = 1/(1+np.exp(-x))    return output`
`# convert output of sigmoid function to its derivativedef sigmoid_output_to_derivative(output):    return output*(1-output)`
`# training dataset generationint2binary = {}binary_dim = 8`
`largest_number = pow(2,binary_dim)binary = np.unpackbits(    np.array([range(largest_number)],dtype=np.uint8).T,axis=1)for i in range(largest_number):    int2binary[i] = binary[i]`
`# input variablesalpha = 0.1#set dxinput_dim = 2#set dahidden_dim = 16 #set dyoutput_dim = 1`
`errors = []`
`# initialize neural network weightsWa = 2*np.random.random((input_dim,hidden_dim)) - 1ba = np.zeros((1, hidden_dim))Wy = 2*np.random.random((hidden_dim,output_dim)) - 1by = np.zeros((1, output_dim))Wh = 2*np.random.random((hidden_dim,hidden_dim)) - 1`
`Wa_update = np.zeros_like(Wa)ba_update = np.zeros_like(ba)Wy_update = np.zeros_like(Wy)by_update = np.zeros_like(by)Wh_update = np.zeros_like(Wh)`
`# training for j in range(100000):        # generate a simple addition problem (a + b = c)    a_int = np.random.randint(largest_number/2) # int version    a = int2binary[a_int] # binary encoding`
`b_int = np.random.randint(largest_number/2) # int version    b = int2binary[b_int] # binary encoding`
`# true answer    c_int = a_int + b_int        overallError = 0        #list for a values    a_values = list()    #a<0> initializaed as 0 vector    a_values.append(np.zeros(hidden_dim))        # moving along the positions in the binary encoding    for position in range(binary_dim):                # generate input and output        #input binary is processed in reverse order        X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]])        y = c_int`
`# hidden layer (input ~+ prev_hidden)        a_t = sigmoid(np.dot(X,Wa) + ba + np.dot(a_values[-1],Wh))`
`# output layer         # only the last timestep has output,        # and only last timestep can update Wy, by        if position == binary_dim-1:            pred_yt = np.dot(a_t,Wy) + by`
`#loss function: 0.5*(y_pred - y)**2            #no more sigmoid            pred_yt_error = pred_yt[0][0] - y                    kT_delta = np.array([[pred_yt_error]])            overallError = (0.5*pred_yt_error**2)                # store decoded predicted digit             final_pred = np.round(pred_yt[0][0])                # store hidden layer so we can use it in the next timestep        a_values.append(copy.deepcopy(a_t))        #h<T+1>_delta initialized as 0 vector    future_ht_delta = np.zeros(hidden_dim)        for position in range(binary_dim):        #backpropagate from last timestep         X = np.array([[a[position],b[position]]])        at = a_values[-position-1]        #get a<t-1> for later use        prev_at = a_values[-position-2]                # error at hidden layer        ht_delta = (future_ht_delta.dot(Wh.T)) * sigmoid_output_to_derivative(at)`
`# let's update all our weights so we can try again        if position == 0:            Wy_update = np.atleast_2d(at).T.dot(kT_delta)            by_update = kT_delta        Wh_update += np.atleast_2d(prev_at).T.dot(ht_delta)        Wa_update += X.T.dot(ht_delta)        ba_update += ht_delta                future_ht_delta = ht_delta        #gradient descent    Wy -= Wy_update * alpha    by -= by_update * alpha    Wh -= Wh_update * alpha     Wa -= Wa_update * alpha    ba -= ba_update * alpha`
`#reset to 0 for next example    Wy_update *= 0    by_update *= 0    Wh_update *= 0    Wa_update *= 0    ba_update *= 0        # print out progress    if(j % 1000 == 0):        print ("Error:" + str(overallError))        errors.append(overallError)        # print ("Pred:" + str(final_pred))        # print ("True:" + str(c_int))        # print (str(a_int) + " + " + str(b_int) + " = " + str(final_pred))        # print ("------------")`
`plt.plot(errors)plt.xlabel('iterations')plt.ylabel('loss')plt.show()`

I also plotted the loss curve, and you can see that the loss is indeed decreasing.

Sample output:

`Pred:-1.0True:699 + 60 = -1.0------------Pred:114.0True:9518 + 77 = 114.0------------Pred:128.0True:15539 + 116 = 128.0------------Pred:173.0True:17095 + 75 = 173.0------------Pred:116.0True:8067 + 13 = 116.0------------Pred:192.0True:18199 + 82 = 192.0------------Pred:149.0True:14746 + 101 = 149.0------------Pred:130.0True:1125 + 107 = 130.0------------Pred:198.0True:182114 + 68 = 198.0------------Pred:163.0True:200109 + 91 = 163.0`

From the output, you may notice that it is harder to train the RNN to learn to add binary numbers to directly get the decimal result. This is because that in this case, the model not only needs to learn to add two binary numbers but also needs to learn how to convert the binary sum into the decimal result, while in the previous many-to-many example the RNN only needs to learn to do binary digit addition, which is easier.