Original article was published by Cynthia Masetto on Artificial Intelligence on Medium
Baking the perfect ML model
“…The application of deep neural networks remains a black art. Often requiring years of experience to effectively choose optimal hyperparameters. Currently the process of setting the hyperparameters requires expertise and extensive trial and error and is based more on serendipity than science… “ — Leslie N. Smith
When you mention “machine learning hyperparameter optimisation”, people tend to glaze over and assume that you’re talking about something super complicated. However, whilst hyperparameter optimisation might sound scary, it really isn’t all that complicated. You can think of it as a bit like tweaking the recipe of your cake (or ML model) to make it tastier (more accurate). When you buy a cake from a store (use a pre-trained model or API), you may not know exactly what’s inside and you have to trust that the bakers have used good quality ingredients and that you’ll like the flavour of the cake (that the use case they built their model for is the same or very close to yours). Here’s the thing though, you don’t have to be a master chef to bake your own ML models. You can use tips and tricks to tweak the recipe and bake something even tastier than the store-bought option.
One of the main problems with hyper-parameter optimisation is that it usually takes a lot of time and energy. You might end up cooking hundreds of cakes before you perfect your recipe. The saying ‘time is money’ is especially applicable when you might be consuming large clusters of cloud compute resources as part of your ‘baking’ process. In private R&D and to some extent in academia, there is a current trend of researchers just caring about a model’s statistical performance and being happy to throw piles of money at achieving state-of-the-art performance at a given task by running exhaustive tests using techniques like GridSearch to try hundreds or thousands of permutations of hyperparameters until they find the best ones. However, in practical and applied machine learning tasks, most projects are also restricted by budget and time and therefore training time and hardware availability are both very important factors.
So, the question is: what happens if you want to increase that accuracy whilst keeping the cooking and experimentation time minimal? Instead of randomly tweaking the recipe hundreds of times, is there a smart way we can work out which ingredient you should change to improve this recipe?
This article will talk about my journey on finding the best hyperparameters optimization recipe for neural network type models.
First of all let’s explain what hyperparameters are and why they are so important in any machine learning model. Hyperparameters are the variables which determine how the learning process will take place. For a neural network, these variables might include: number of hidden layers, number of neurons, activation function, weights, learning rate, momentum, regularisation, dropout. In contrast, for a random forest, hyperparameters include the number of trees in the forest, the maximum depth of each tree, the minimum number of samples required to split the leaf node and so on. These hyper-parameters are highly dependent on the machine learning model that we want to bake. For all the models, these variables need to be set before training and also have a huge impact on the way that your model behaves (like ingredients in a cake). That’s why choosing the right set is crucial to a well trained model.
On the other hand, parameters (not ‘hyper’ parameters) are the coefficients of the model, and they are chosen by the model itself. These parameters will be trained and the model will return those parameters that minimise the error according to a given optimisation strategy. So to reiterate: The difference between hyper-parameters and just plain ‘parameters’ is that hyperparameters will not be updated algorithmically and human (or semi-automatic) intervention is always needed but parameters get chosen by the model.
When I started this search, I found that, just like there’s no “silver bullet” for baking the perfect cake (methods and recipes vary depending on whether you want cupcakes or a bakewell tart) there are no simple ways to set hyper-parameters for all possible machine learning models. There are many task settings, therefore many strategies.
There are a number of blog posts that explain and define hyperparameters for specific models and what those hyperparameters do. E.g.
- Goodfellow, et al., “Neural networks and deep learning”, 2015. Online resource: http://neuralnetworksanddeeplearning.com/chap1.html 
- Andrew Ng’s course on Improving deep neural networks, hyperparameter tuning, regularisation and optimization. https://www.coursera.org/learn/deep-neural-network 
- Cornell Education, Advanced Machine LEarning Systems — Fall 2017 by Christopher De Sa. http://www.cs.cornell.edu/courses/cs6787/2017fa/ 
However, they often limit themselves to hyperparameter definitions only. Practical advice for how to tune these parameters are not usually provided. Model performance is highly dependent on good choices made by the data scientist who may be missing the context that helps them to understand the relationship between model characteristics and hyper-parameter values at training time.
I’m not going to go through all the neural network hyperparameters definitions but I will discuss some steps that have worked for me after a lot of research.
The following questions/ingredients are necessary to find the perfect recipe for a neural network. In other words, these steps are crucial when looking for optimal hyperparameters and they need to be decided before baking:
- How to choose between babysitting one model or training many models in parallel.
Babysitting is all about trial and error of one model type at a time. You typically bake several of the same model type with tweaked recipes and see which one came out best. For example, we might take a simple multi layer perceptron (MLP) and tweak the momentum and learning rate. Alternatively you may wish to conduct parallel model search in which we try many completely different recipes at a time (e.g. we train an MLP, an LSTM and a Random Forest). In both approaches you may bake (or train) many models at the same time if you have a large enough training environment. However, the difference is whether you are tweaking a single model or many models.
The decision often comes down to what you already know about the problem that the model will try to solve. Some models perform better than others for some use cases — for example an LSTM is probably better for solving time series problems than a simple MLP). If you don’t understand the characteristics of the problem then you might want to try all of the models. If you have a good idea about the problem and which models are good for solving it, you might want to stick to babysitting. The following link provides some good advice for how to choose the right ML algorithm according to the data you have or the problem you want to solve.
2. Should I normalise or scale the inputs?
According to LeCun et al., normalising the inputs as part of data preprocessing can make convergence happen faster — without this preprocessing step, the model has to also learn how to normalise inputs for comparison.
The difference is that normalisation scale variables to have values between 0 and 1 and standardisation means transforming data to have a mean of zero and standard deviation of one. The choice depends on your data and whether the input variables are uncorrelated.
There are different scaling and normalisation approaches depending on the input data. Note that some of them may suffer from the presence of outliers, such as: MinMax or Standard Scaler. The following link from skit-learn shows a good summary on these different transformation operations.
3. How to choose the number of neurons?
There are several approaches to select the number of neurons. The number of neurons secure the ability of the network to generalise, using few neurons could result in underfitting. In other words, your cake will be undercooked or not cooked at all. Using too many could result in overfitting. In other words, your cake is overcooked or burned. To avoid overcooking or undercooking your model Jeff Heaton recommends selecting a number of neurons between the size of the input layer and the size of the output layer. This approach results in either taking ⅔ the size of the input layer plus the size of the output layer, or choosing a number that is less than twice the size of the input layer.
4. How to choose an activation function?
The activation function is the non linear transformation that we do over the input signal and depends on the task such as: classification, regression or multiclass. The most common are Sigmoid function(regression), SoftMax(classification) and Relu. All of these functions have advantages and disadvantages. A popular example of a disadvantage is that sigmoid function tends to vanish gradient as it uses a mechanism to reduce the gradient (that is the derivative of the sigmoid). For that reason, Relu is used to avoid vanishing gradients and run the task in a much lower time.
5. How to initialise the weights?
Initialising the weights is a bit like choosing how to distribute your cake batter to go into the oven and also has a profound effect on how well the model bakes. Spreading your batter across a long and thin tray will produce a light and fluffy tray bake. Putting all your batter in a tall and narrow circular container will produce a rich and moist celebration cake.
A correct weight initialisation can help prevent the gradients of the network’s activation function from vanishing or exploding. Too large initialisation leads to exploding gradients. On the other hand, too small initialisation leads to vanishing gradients.
The most popular weights initialisations: Random initialisation — which is better than just 0 assignment, however choosing a reasonable initialisation of weight values could be tricky with this one. Xavier initialisation with uniform or normal distribution (better for Sigmoid or Tanh activation function). He-initialised with normal distribution (better for ReLU). This article provides some explanation about some options for different model initialisation approaches.
6. How to pick the right learning rate?
Smaller learning rates require more training epochs, given the smaller changes made to the weights each update (it makes the network adjust slowly and carefully). This is like cooking the cake on a low temperature over a long duration. On the other hand, larger learning rates result in rapid changes and require fewer training epochs (it adjusts quickly but might be overshooting). This is like putting the cake in a very hot oven and hoping it doesn’t burn. It is recommended to start with a learning rate of 0.01 and then increasing this once you’re confident that you’re not going to overshoot (just like how you might start off with a low temperature and turn it up towards the end to get a brown/caramel finish).
7. How to pick the right momentum?
Momentum helps accelerate gradients in the right direction and it’s usually chosen after choosing the right learning rate. A large value of momentum means that the convergence will happen fast. When the momentum term is large, then the learning rate should be small and the other way around. The starting momentum is usually set at: 0.9. This is like adding baking soda to our cake batter, it will help it rise. In this case it will help us escape local minima. If you don’t add enough soda, the cake will be small and soggy, if you add too much, it will become dry and tough.
8. Early stopping… but when?
Epochs help define the number of times that the algorithm will learn or work through the data. Early stopping is one type of regularisation technique, because it helps avoid overfitting. The right way to stop is by following the validation data set. When the validation recall starts decreasing it’s a sign that the training needs to stop. This is like trying to work out when the cake is cooked, if you leave it in the oven too long it’s going to overcook or get burned, if you leave it too little it’s not going to be cooked.
9. How to choose a dropout probability?
Dropout removes some nodes when the network is heavy, which makes the network not that heavy. The good thing about dropout is that it can be implemented during the training phase. The probability is usually set at 0.5.
After you decide all the answers to these questions it’s nearly baking time but there’s one more question left…
The final decision to make is around which optimization algorithm to use. This algorithm will calculate the parameters and weights that minimise the loss function (so not hyperparameters) and this is also chosen before training. This is a bit like choosing a type of oven, do you cook on the stove top in a frying pan? Do you cook in the oven? Do you cook in the microwave or even on the barbeque? All of them will cook your model but at different speeds and with different approaches and produce very different outcomes.
The optimizer you want to use will depend on the classification/regression problem. The most popular optimizers are: SGD, RMSprop, Adam, Adadelta, Adagrad, Adamax, Nadam and they differ on adaptive learning rate methods and different types of convergence. On the other hand, the most used loss functions are: MSE, categorical cross-entropy and binary cross-entropy. MSE is for regression problems and categorical and binary are for classification problems.
Finally, your cake is ready! That leads us to the final question:
10. How do we choose what ingredient (hyperparameter) to tweak next?
After placing all the ingredients inside the model if you think you could improve the results you can always update the hyperparameters. I would recommend starting with the number of epochs. This would be adding maybe more minutes to the oven or maybe adding some icing or cherries on top.
At Filament, we use these tricks and techniques to train models to classify both numerical and textual data to help our clients make best use of their unstructured data. Our hyperparameter optimisation system, named Cielo, accelerates our model development process and has improved a number of our existing models by as much as 10%.
I hope after this article helped you understand how to pick the right amount of ingredients for your model. As we know, neural networks are hard to train and many factors can play a role in that. However, the choice of hyperparameters beforehand is important. Also, understanding how hyperparameters work is an ongoing research. The good news is that with this recipe you can have a greater understanding on what to add and how to move hyperparameters, try these steps and let me know how it goes!
Thanks for reading!
- Leslie N. Smith (2018). “A disciplined approach to neural network hyperparameters: Part 1 — Learning rate, batch size, momentum and weight decay”. Cornell University.
- G. Goos, J. Hartmanies and J. van Leeuwen (1998). “Lecture Notes in Computer Science: Neural Networks: Tricks of the Trade”. Springer.
- Yann LeCun, Leon Bottou, et. al. (1998). “Efficient BackProp. Springer
- Jeff Heaton (2017). “The Number of Hidden Layers”. Heaton Research. Online resource: https://www.heatonresearch.com/2017/06/01/hidden-layers.html
- Goodfellow, et al. (2015). “Neural networks and deep learning”. Online resource: http://neuralnetworksanddeeplearning.com/chap1.html
- Andrew Ng’s course on Improving deep neural networks, hyperparameter tuning, regularisation and optimization. https://www.coursera.org/learn/deep-neural-network
- Cornell Education. Advanced Machine LEarning Systems — Fall 2017 by Christopher De Sa. http://www.cs.cornell.edu/courses/cs6787/2017fa/
- Yoghita Kinha, “An easy guide to choose the right Machine Learning Algorithm’. https://www.kdnuggets.com/2020/05/guide-choose-right-machine-learning-algorithm.html