Original article was published on Deep Learning on Medium
For the saliency modulation model, we’ve applied the two-branches structure: RGB branch which takes the original image as input (with shape (H,W,C)) and the saliency branch which takes the pre-computed corresponding saliency map as input (with shape (H,W,1)). The output of the model is ground-truth label for the images. We will introduce the architecture step by step.
- Input Images: As described in the previous section, two datasets are used, the source dataset ImageNet, which is used for training Tensorflow pre-trained ResNet50, and the target dataset PubFig, which is used for training the target model. In addition, ImageNet images with the corresponding saliency image are used for pre-training the Saliency Modulation Model. Likewise, PubFig images with the corresponding saliency images are used to train the Saliency Modulation Model for the main image classification task.
- RGB Branch: for the RGB branch, the structure is the same as the baseline transfer. It takes the 3-channel original image as input. The structure is shown below:
- Saliency Branch: for the saliency branch, the structure is almost the same with the RGB branch to make sure its spatial dimension can match the one from RGB branch during the fusion stage. However, there are several designs need to be noticed: (1) the input of saliency branch is the 1-channel saliency map; (2) different from RGB branch with the ReLu non-linearity, the sigmoid activation is applied at the end of saliency branch. This design will help to make sure the output of saliency branch is within [0, 1], which provides a suitable range of feature modulation.The structure is shown below:
- Modulation: After several experiments, the two branches are chosen to be combined by modulation (x symbol, which stands for the element-wise production) after the stage1 of the ResNet but before the maxpooling layer. The reason for the fusion before maxpooling is to make full use of saliency in higher resolution. In addition, we’ve also borrowed the idea of ‘skip connection’, which prevents the model from completely ignoring the features from RGB branch. To be more advanced, a parameter can be assigned to control the importance of this skip connection. The structure is shown below:
- Weight Initialization: there are two weight initialization methods that we’ve tried:
(1) Use the pre-trained weight from ResNet50 on RGB branch (without fully connected layers), use Xavier uniform to initialize the rest. Details are shown in Figure 4. We will refer it as ‘half pre-trained’ in the following section.
(2) Pre-train the whole 2-branch structure on ImageNet, and then use the weights to initialize all the weights of our network except those for the fully connected layers. Details are shown in Figure 5. We will refer it as ‘fully pretrained’ in the following section.
For the PubFig dataset, we have 2218 images for training, 555 for validation and 309 for test. We will mainly compared the performance between three models: baseline transfer model, saliency modulation model(half pretrained) and saliency modulation model(fully pretrained). For all models, we use cross-validation and fine tuned it for 50 epochs. We use SGD as optimizer, with learning rate as 0.0001 and momentum as 0.9. For all the layers do not have a pre-trained weight to initialize, we use Xavier.
First, let’s take a look at the performances from baseline transfer and saliency modulation with only pretrained weight for RGB branch.
From the plot and the test accuracy statistic above we can see that the two model seem to converge into equally good solutions. However, we can compare the convergence speed of the two models by zoom-in their learning curve on the first 10 epochs and get the plot below:
From the plot above wen can see that saliency modulation model converges faster than the baseline transfer learning.
Let’s then compare the performances between the half pretrained and the fully pretrained ones.
It’s very surprising to see that the test performance has dramatically decreased and the problem of overfitting has become more severe, which contradicts the findings in  as well as the intuition. After careful inspection, we find that the possible reason is that we don’t get a decent pre-trained 2 branches model within our limited time and resource. The pre-trained model itself has a severe problem of overfitting already, and therefore can not be used as a good initialization, or even hurt the performance of the target model.
To be more specific, the ImageNet data (ImageNet original with its corresponding saliency map) we’ve construct at our best for pretraining has 8828 training, 2057 validation and 1143 test images. By contrast, the half-pretrained model uses the weight trained on the whole ImageNet training set, which contains more than millions of images. The best pre-trained model we get has 0.9983 accuracy on test set with less than 0.60 test accuracy, which already indicates the problem of strong overfitting. However, during our experiments we’ve find that when we increase the valid data volume to the pre-trained model, the performance of the pre-trained model will be improved, and the performance of our target model will be improved as well. Therefore, it’s very promising that if the resource is enough, the performance of the target model can be largely improved, and even outperform the baseline transfer in both speed and accuracy.
In conclusion, we’ve found that when both only initialized with the ResNet50 weights trained on the ImageNet, the saliency modulation model can largely improve the learning speed than the baseline, while the improvement on the convergence result is subtle. In addition, based on the experiments we’ve done so far, we believe that if the saliency modulation model can be fully pretrained properly, it will outperform the baseline transfer model not only in speed, but also in accuracy.
Conclusion & Future Work
In this project, we studied how saliency map could directly help to improve model performance. We explored modeling approaches of taking saliency maps as model input and applied *delayed fusion* technique to integrate saliency information into a two-branches model structure. In addition, we used transfer learning to transfer knowledge from models pre-trained on large datasets to the training with scarce datasets.
As the experiments show, Saliency Modulation Model has faster training speed. The intuition behind is straightforward: saliency maps generated from the pre-trained model contain “knowledge” of recongizing objects from the background, and when we fuse these saliency information to the model, the model can quickly detect the most representative area of the object and thus can learn useful features more efficiently. Due to time and resource constraints for pre-training saliency modulation model, the target model accuracy is not as expected. However, we can see that as we enlarge the dataset for model pre-training, the target model test accuracy increases. This demonstrate that the current pre-trained saliency modulation model is overfitted. In other words, if we pre-train the saliency modulation model on a larger source dataset, the target model would receive better results. Theoretically, we expect that the target saliency modulation model should receive better accuracy with less epoches compared to the baseline model.
In addition, the source data does not contain object classes related to humans. Therefore, the current saliency map may not provide the optimal saliency information for human facial features detection. Although using non-human facial image class generate relatively desirable saliency maps, it would be even better if we can pre-train the source model with human facial image classes, which meanwhile requires huge computation resource and time to train.