Using soft attention saliency maps for vision neural nets prediction interpretation

While reading the nature article on predicting cardio vascular risk factorsby training deep neural nets on retinal images. One thing stood out about interpreting the features build by neural nets using soft attention saliency maps.

Details on the paper:

Assumption here is that the saliency map generated by simpler network would be similar to the features used by more complex/bigger network for generating predictions. I am little uncomfortable with this assumption, none the its a good tool to validate the features extracted by neural networks and can catch other issues in your training like data leakage, etc. Thoughts??

Mapping attention: To better understand how the neural-network models arrived at the predictions, we used a deep-learning technique called soft attention.

Briefly, we used the following architecture: the input images were 587 × 587 pixels and the saliency map was originally 73 × 73 pixels. There were 3 (2 × 2) maxpool layers and 4 (3 × 3) convolutional layers before the saliency map, as well as 3 reverse maxpool layers to upscale the saliency map from 73 × 73 pixels back to 587 × 587 pixels. The convolutional layers contained 64, 128, 256 and 512 filters. The path from input image to saliency map is described in Supplementary Table 1. The first two dimensions are image size. The third is number of filters (or channels in the case of the input image).

These small models are less powerful than Inception-v3. They were used only for generating attention heat maps and not for the best performance results observed with Inception-v3. For . each prediction shown in Fig. 2, a separate model with identical architecture was trained. The models were trained on the same training data as the Inception-v3 network described above, and the same early stopping criteria were used.

To provide a qualitative assessment of the features that are highlighted in the heat maps, we generated 100 images for each of the predicted factors from 3 image sets for a total of 700 images. For the BMI, current smoker, SBP and DBP predictions, we randomly sampled 100 images for each of these predictions from the UK Biobank dataset. For HbA1c, we randomly sampled 100 images from the EyePACS dataset. For age and gender, we randomly sampled 50 images from the EyePACS dataset and 50 from the UK Biobank dataset. The 700 images were shown to three ophthalmologists in the same (randomized) order using a survey form (see Supplementary Fig. 1 for a screenshot of the form) for a total of 300 responses per prediction. On the basis of feedback from the ophthalmologists, we aggregated their responses so that veins, arteries, arterioles, venules and vessel surroundings were reported as ‘vessels’, optic disc and optic-disc edges were reported as ‘optic disc’ and image edges and ‘nothing in particular’ were reported as ‘non-specific features’. The macula was not one of the checkbox options, but ophthalmologists repeatedly reported highlighting of the macula for the gender predictions.

Source: Deep Learning on Medium