Catastrophic Forgetting in Neural Networks

Source: Deep Learning on Medium

Catastrophic Forgetting in Neural Networks

The article is on the analysis of catastrophic forgetting using permuted MNIST and MLP networks.

Deep learning approaches have lead to many breakthroughs in various domains, yet they suffer from credit assignment and ‘forgetting problem’. Deep learning systems have become more capable over-time, however standard multi-layer perceptron(MLP) and traditional training approaches cannot handle incrementally learning new tasks or categories without catastrophically forgetting previously learned training data. We call this problem catastrophic forgetting in neural networks. Fixing this problem is critical so that agents learn and improve incrementally when deployed in a real-life setting. In simple terms when you have trained a model on Task A, and using the same weights for learning a new Task B, then your model forgets learned information about Task A. This means it catastrophically forgets previous information. In this article, we will be measuring and analyzing catastrophic forgetting in neural networks.


  1. We will be using a permuted MNIST for this assignment.
  2. Train a network for more than 50 epochs to reach desirable performance/accuracy.
  3. Now test on Task A, use the same network and then train on Task B for 20 epochs, now Test on Task A and B.Continue this for N Task(N = 10).
  4. Total Number of Epochs [50+20+20+20+20+20+20+20+20+20]= 230


The coding was done in Tensorflow 1.14 using eager execution (Python 3.6) in a Unix environment. A unique seed (using name and random integer) is chosen to create 10 different permutations (Figure 1) of the MNIST dataset and each permutation is treated as a different task. Also for each permutation,
the dataset has 49500 training images, 10000 testing images, and 5500 validation images. We train three types of MLPs (2-layer deep, 3-layer deep, 4-layer deep) on the permutation dataset. Each MLP is trained for 50 epochs for the first task, and 20 epochs for further 9 tasks (230 tasks). The aim is to minimize catastrophic forgetting in neural networks. If an MLP starts training on further tasks, it should remember and retain weights for most of its previous tasks on which it was trained.

Performance Metrics
Four different performance metrics are used to estimate the level of forgetting in the neural networks. All the metrics used for this experiment are discussed below.

  1. Resultant Task Matrix

After the model finishes learning on each task, its validation and test performance is evaluated to form a resultant task matrix. We construct a matrix

where M_i,j is the classification accuracy t of the model on task t_j after training on task t_i. After building the resultant matrix, we can use the matrix for calculating other performance metrics such as

2. ACC (Average Accuracy) Score
Intuitively, ACC Score is the average of the accuracies obtained for each task after training on the last task is complete (shown in Equation, T is the number of tasks). According to [1], ACC score is the most important metric for evaluation of forgetting in deep neural networks. The larger this metric, the better the model.

3. BWT (Backward Transfer) Score
Backward Transfer Score is the measure of how the performance of the model on task m is affected if it trained on task n (m<n). The larger the value of this metric, the better. Usually, we observe a negative backward transfer score because of the forgetting nature of deep learning models. If the backward transfer score is positive, it means that the model was able to improve the performance on old tasks after training on a new task. The formula for BWT is given in Equation. If the ACC score of the two models is similar, the model with higher BWT is chosen usually.

4. CBWT (Cumulative Backward Transfer) Score
[2] proposed two other performance metrics since the above two mentioned metrics may be inadequate to evaluate forgetting in neural networks. CBWT is an extension of BWT. The aim of CBWT score is to measure the total amount of forgetting throughout the entire sequential learning process, instead of simply examining the last row, as BWT can only do. The formula for calculating CBWT is given in Equation.

where t is the current task number and T is the total number of tasks.

5. TBWT (True Backward Transfer) Score
The other evaluation metric proposed by [2] compares the accuracy of each task (after training on all the tasks is complete) with an independent classifier (G). TBWT is similar to BWT, but we compare the degradation of the model for each task with a gold standard i.e. trained with full capacity. For our experiment, the independent classifier is the Random Forest Classifier (with 64 estimators and accuracy of 90% usually for each MNIST permutation). The formula for calculating TBWT is given in Equation.

where G_ii the accuracy of an independent classifier trained on Task i.

Optimal Training Model

In this section, you will see the comparison of various hyper-parameters, loss functions, and optimizers to find the most optimal neural network which can avoid catastrophic forgetting.

  1. Comparison of Learning Rates

It was noticed that if the learning rate is too high, we can usually get a very high accuracy score on the current task during training. However, in that case, the forgetting tendency of the neural network for the previous networks is more catastrophic. Hence, we need to find the optimal learning rate to
train the neural network so that the forgetting is not catastrophic and the testing accuracy for the current task is not too low. As seen in Figure 2 (only for task 1 for 230 epochs), when the learning rate is too high, the classification accuracy reaches as low as 30%. However, when the learning rate is too low, the network hardly forgets. We do not get an optimal accuracy on the current task (the accuracy of the first task is just 70% after 50 epochs). Hence, the learning rate is chosen as 0.0002 which was observed to be the optimal learning rate after empirical analysis.

2. Comparison of Optimizers

Three different optimizers (Adam, RMSProp, and SGD) are compared for this experiment with a learning rate of 0.0002 to see which optimizer helps the network to avoid catastrophic forgetting. In Figure 3, we can see the comparison of validation accuracy for optimizers trained on the first task (230 epochs). As shown, Adam and RMSProp are way better for this purpose than SGD because SGD has a steep accuracy decline. However, it was observed that the ACC and BWT scores were or less the same while training with Adam or RMSProp. For the rest of the experiments, Adam is used as the desired optimizer.

3. Comparison of Loss Functions

Since it is a classification task, categorical cross-entropy with softmax function and one-hot encoding is used. However, instead of just using vanilla loss function, it is combined with various regularizers (L1, L2, L1 + L2) for performance comparison. We use a very small beta (10–7) coefficient for regularization after some empirical analysis. The results are depicted in Figure 4. We can see that Cross entropy combined with L1 regularization can better help in avoiding catastrophic forgetting for some tasks. However, L2 regularization and hybrid regularization gave a weaker performance as compared to vanilla cross-entropy. The ACC and BWT scores for different loss functions (after training on all tasks) are depicted in Table 1.

4. Comparison of Depth of Neural Network

It was observed that deeper neural networks are more prone to catastrophic forgetting. As seen in Figure 5 and Table 2, 2-layered neural network has the best ACC and BWT score (mean of 3 runs). It can be explained by the fact that changing the weights of a deeper MLP may introduce heavy changes in the end result. The ACC score degraded from 0.62 to 0.54 when switching from a 4-layer to 2-layer MLP. As seen in Figure 5, the decline is much steeper for many tasks for a 4-layered network, but we observe a smooth decline for a 2-layered network.

Dropout for 4-layered network

A dropout layer was added after each layer with different dropout probabilities. In [1], it is mentioned that adding dropout helps neural networks to mitigate catastrophic forgetting. However, it was observed that higher probability does not always lead to better performance. In fact, the performance improve for a dropout probability of 0.1 after which it started to degrade for higher probabilities. In Figure 6, we can see that adding a higher probability means that most tasks do not reach optimal accuracy which leads to degradation of the ACC score. As seen in Figure 6, we see a steep decline in the ACC and BWT scores after training with a dropout probability of 0.2. Hence, we stick to a dropout probability of 0.1 for further evaluation.

Best Performance Metrics

2-layer deep neural network with Adam Optimizer, L1 regularizer, 0.1 dropout, and 0.0002 learning rate is used as the optimal model for getting the best performance. The resultant task matrix by using this network is shown in Table 4. The validation accuracy for each task is shown in Figure 7. As one can see, most of the task accuracies are still above 60% after training on all 10 tasks. Next, we evaluate the performance metrics based on the resultant task matrix that was mentioned previously.

ACC Score: 0.6535

BWT Score: -0.27

TBWT Score (Using Random Forest): -0.291