Source: Deep Learning on Medium
Transforming SoftMax Layer
The idea behind the transformation is to distribute all features on a hypershpere and add an angular margin between features of each class so as to enhance their discriminative power (Liu, et al., 2017).
Initially let’s recall the SoftMax function:
We can rewrite the logit
as the inner product of the first term plus the second
Where θ is the angle between the weight and the feature. Furthermore if we normalize the weights and zero the biases
notice that the final prediction for xᵢ depends only on the angle θ (Liu, et al., 2017). Therefore the SoftMax function is transformed to
Another modification we should make is normalizing the norm of feature vectors to remove variations in radial directions. As a result, features lie on the surface of a hypersphere with radius of ‘s’ and learning only depends on cosine values to develop the discriminative power (Wang, et al., 2018).
This modified SoftMax can learn features separated with angular boundary but this doesn’t mean that it is strong enough to satisfy our initial hypothesis. Thus we will add an additive angular margin penalty between the weight and the feature to simultaneously enhance the intra-class compactness and inter-class discrepancy. Many angular margin penalties have been proposed, but the additive one has a better geometric attribute than the others, with the exact correspondence to the geodesic distance on a hypersphere manifold (Deng, Guo, Xue, & Zafeiriou, 2018). The final transformed SoftMax layer with the added margin is shown below:
In figure 3 the whole process of the modification is presented
The features computed in test set of the last fully-connected layer, are visualized below in the 3D Euclidean space using PCA.
As we can observe, features of each class are very compact and separable to the other classes thanks to the additive angular margin ‘m’(0.5 above). Therefore, this method is suitable for datasets with thousents of classes with features that are not highly discriminative.
Below is the code for the tf.keras modified SoftMax layer for training with two inputs (features, correct_labels):
The complete code can be found in my github link. You can experiment with different values for ‘m’ to see its impact in the final visualization.