Original article was published on Deep Learning on Medium
- Properties of Mish Activation Function are:-
— Non-monotonic function: Preserves negative values, that stabilize the network gradient flow and unlike ReLU, and almost solving Dying ReLU problem and helps to learn more expressive features.
— Unboundedness and Bounded Below: Former helps to remove the saturation problem of the output neurons and the latter helps in better regularization of networks.
— Infinite Order of Continuity: Unbiased towards initialization of weights and learning rate due to the smoothness of a function helping for the better generalizations.
— High compute function but increases accuracy: Although being high-cost function, it has proven itself better in deep layers in comparison to ReLU.
— Scalar Gating: Scalar Gating is an important property of this function and so it becomes logical and can easily replace the pointwise functions like ReLU.
Using Mish as an activation function in YOLOv4 showed decent amount of accuracy gains. Mish Activation + CSPDarknet53 combo gave the best results in the ablation study mentioned in the paper.
Increased some computational cost, but showed good refinement of the detector results. Bag of Specials definition hence proved..!!