Source: Deep Learning on Medium
Implementing the New State of the Art Mish Activation With 2 Lines of Code In Pytorch
State of the art deep learning never felt so easy
This paper by Diganta Misra came out recently about a new activation function for deep learning called the mish activation. This new activation function beat both the ReLU and swish activation functions when tested on CIFAR-100 with the Squeeze Excite Net-18. I highly recommend going and reading the paper I linked above if you want the details about the research and the activation function. I am not going to go into depth on the math and research of the paper, but the function looks like this.
If you are familiar with activation functions, you might be thinking that it looks a whole lot like the swish activation. That is because mish was inspired by swish. From an initial read of the paper, it seems like mish could potentially be better than both the swish and extremely popular ReLU activations. The best part about this brand new awesome activation function is that you can implement it with 2 lines of code.
First, I am going to show you how to implement mish in a neural network that you build yourself. Before we build our network, we need to write the mish function using PyTorch. As promised, it only requires 2 lines of code.
And with those two lines of code, we wrote a state of the art activation function. So now lets write a basic CNN and implement our activation function into it.
In the forward section I set the activation function for all of the linear layers to the mish activation we wrote above. Now the model is ready to train. It is pretty straight-forward and easy to implement! I ran this model on the aerial cactus identification data in Kaggle and saw a 1.5% increase in accuracy over a ReLU activation after 10 training epochs. I won’t complain about that.
Building your own neural network is cool, but hardly ever practical. Transfer learning tends to be much more effective when it comes to obtaining top level results in deep learning. So lets look at how we could implement the mish activation with VGG-16. Instead of a function we need to write a class for our mish activation that inherits from the torch.nn.Module class. It should look something like this:
I apologize that I promised 2 lines of code and now I am changing it to 5. I hope after you use this and see how cool it is you will forgive me though. With our activation function written as a class, we can now prepare add it to our VGG-16 model.
We change the ReLU activations in the classification part of VGG-16 to mish activations and replace the last layer with one for our classification problem. After removing the gradients for all but the last layers, we are ready to train! And with those little pieces of code, you have implemented a state of the art activation function.
A Few Tips
There are a few things mentioned in the mish paper that I think are worth noting if you want to implement it:
- The mish function performs better at a lower learning rate than what you might use for other activation functions. So make sure you don’t go to high.
- Mish activation seems fairly robust to changes in hyper-parameters. Check out the paper for more on this.
- There is no best activation function. This will not always be better than everything else. It is definitely worth a try though.
If you are interested in seeing my full code, I put it on Kaggle here. Feel free to clone it and try different architectures or to apply mish activation to the convolutional layers too.