Enhancing the power of Softmax for image classification

Source: Deep Learning on Medium

Transforming SoftMax Layer

The idea behind the transformation is to distribute all features on a hypershpere and add an angular margin between features of each class so as to enhance their discriminative power (Liu, et al., 2017).

Initially let’s recall the SoftMax function:

We can rewrite the logit

as the inner product of the first term plus the second

Where θ is the angle between the weight and the feature. Furthermore if we normalize the weights and zero the biases

notice that the final prediction for xᵢ depends only on the angle θ (Liu, et al., 2017). Therefore the SoftMax function is transformed to

Another modification we should make is normalizing the norm of feature vectors to remove variations in radial directions. As a result, features lie on the surface of a hypersphere with radius of ‘s’ and learning only depends on cosine values to develop the discriminative power (Wang, et al., 2018).

This modified SoftMax can learn features separated with angular boundary but this doesn’t mean that it is strong enough to satisfy our initial hypothesis. Thus we will add an additive angular margin penalty between the weight and the feature to simultaneously enhance the intra-class compactness and inter-class discrepancy. Many angular margin penalties have been proposed, but the additive one has a better geometric attribute than the others, with the exact correspondence to the geodesic distance on a hypersphere manifold (Deng, Guo, Xue, & Zafeiriou, 2018). The final transformed SoftMax layer with the added margin is shown below:

In figure 3 the whole process of the modification is presented

Figure 3: Visualization of the whole SoftMax modification

The features computed in test set of the last fully-connected layer, are visualized below in the 3D Euclidean space using PCA.

As we can observe, features of each class are very compact and separable to the other classes thanks to the additive angular margin ‘m’(0.5 above). Therefore, this method is suitable for datasets with thousents of classes with features that are not highly discriminative.

Below is the code for the tf.keras modified SoftMax layer for training with two inputs (features, correct_labels):

The complete code can be found in my github link. You can experiment with different values for ‘m’ to see its impact in the final visualization.