Random Forest: Hyperparameters and how to fine-tune them

Original article was published by Jaime Zornoza on Artificial Intelligence on Medium


Random Forest: Hyperparameters and how to fine-tune them

How to optimise one of the most used Machine Learning models

Random Forest are an awesome kind of Machine Learning models. They solve many of the problems of individual Decision trees, and are always a candidate to be the most accurate one of the models tried when building a certain application.

If you don’t know what Decision Trees or Random Forest are do not have an ounce of worry; I got you covered with the following articles. Take a quick look and come back here.

In this quick article, we will explore some of the nitty-gritty optimisations of Random Forests, along with what each hyper-parameter is, and which ones are worth optimising.

Lets go!

Hyper-parameter considerations, tips and tricks

The most important hyper-parameters of a Random Forest that can be tuned are:

  • The Nº of Decision Trees in the forest (in Scikit-learn this parameter is called n_estimators)
  • The criteria with which to split on each node (Gini or Entropy for a classification task, or the MSE or MAE for regression)
  • The maximum depth of the individual trees. The larger an individual tree, the more chance it has of overfitting the training data, however, as in Random Forests we have many individual trees, this is not such a big problem.
  • The minimum samples to split on at an internal node of the trees. Playing with this parameter and the previous one we could regularise the individual trees if needed.
  • Maximum number of leaf nodes. In Random Forest this is not so important, but in an individual Decision Tree it can greatly help reduce over-fitting as well and also help increase the explainability of the tree by reducing the possible number of paths to leaf nodes. Learn how to use Decision Trees to build explainable ML models here.
  • Number of random features to include at each node for splitting.
  • The size of the bootstrapped dataset to train each Decision Tree with.

Alright, now that we know where we should look to optimise and tune our Random Forest, lets see what touching some of these parameters does.

Nº of Trees in the forest:

By building forests with a large number of trees (high number of estimators) we can create a more robust aggregate model with less variance, at the cost of a greater training time. Most times the secret here is to evaluate your data: how much data is available, and how many features does each observation have.

Because of the randomness of Random Forest, if you have a lot of features and a small number of trees some features with high predictive power could get left out of the forest and not be used whatsoever, or be used very little.

The same applies for the data: if you have a lot of observations and you are not using the whole dataset to train each tree, if you have a small number of trees, then some observations could be left out.

As Random Forests rarely overfit, in practice you can use a large number of trees to avoid these problems, and get good results following the guideline that when all other hyper-parameters are fixed, increasing the number of trees generally reduces model error at the cost of a higher training time.

Don’t be fooled by this statement though, building a forest with 10K trees is a crazy and useless approach: the main takeaway is that as you increase the nº of trees you will be reducing model variance and generally model error would approximate an optimum value.

Decrease in the classification error as we increase the number of trees and histogram of estimates for a RF with 100000 trees. Source.

Conclusion: fine tuning the number of trees is unnecessary, simply set the number of trees to a large, computationally feasable number and you’re good to go.

The Criteria on which to split on at each node of the trees

Decision Trees make locally optimal decisions at each node by computing which feature and which value of that feature best splits the observations up to that point.

To do this, they use an specific metric (Gini or Entropy for classification) and (MAE or MSE for Regression). For Regression, the general rule is to take MSE if you don’t have many outliers in your data, as it penalises highly those observations that are far away from the mean.

For classification, the thing is a bit more tricky. We have to calculate a measure of impurity with either Gini or Entropy, which can result in a different split sometimes. Take the following examples of a problem where we have two classes, A and B:

  • A node with only observations of class A is 100% pure according to both, Gini and entropy.
  • A node with 10 observations of class A, and 10 of class B is 100% impure according to both, Gini and entropy.
  • A node with 3 observations of class A and 1 of class B is ether 75% or 81% impure, depending if we use Gini or Entropy respectively.

Depending on which of the two we use our model can change. There is not a real rule of thumb here to know which one to pick. Different decision tree algorithms use different metrics (CART uses Gini, whereas ID3 uses Entropy).

Formulas for Gini and Entropy. Self Made image.

Having said this, Gini is usually less computationally expensive to calculate as it does not compute a logarithm. The Scikit-Learn implementation of RandomForestClassifier allows us to choose from both, so it might be worth trying both measures and seeing which leads to an smaller error.

Conclusion: fine tuning the split criteria could lead to different forests, and as there is only two possible values, we recommend trying them both for classification forests.

The Maximum Depth of the Individual Trees

Increasing the Depth of individual trees increases the possible number of feature/value combinations that are taken into account. The deeper the tree, the more splits it has and the more information about the data it takes into account.

In an individual tree this causes overfitting, however in Random Forest, because of the way the ensemble is built, it is harder to overfit, although it is still possible for large depth values.

This parameter should be set to a reasonable amount depending on the number of features of your tree: don’t build stumps (really shallow trees) nor insanely big trees; set this parameter to a reasonable amount and tune it a little bit if you want, but changes around a reasonable value do not impact the performance of your forest greatly, so you don’t have to include it in a procedure like Grid Search if you don’t want.

Conclusion: fine tuning the tree depth is unnecessary, pick a reasonable value and carry on with other hyperparameters.

The Number of random features to consider at each split

This is one of the most important hyperparameters to tune in your Random Forest ensemble, so play close attention.

The best value of this hyperparameter is hard to pick without experimentation, so the best way to obtain it is using a Grid Search with Cross Validation, taking into account the following:

  • A small value (less features considered when splitting at each node) will reduce the variance of the ensemble, at the cost of higher individual tree (and probably aggregate) bias.
  • This value should be set accordingly to how many informative or quality features you have, by taking into account noisy features that have many outliers. If your data set has very clean, polished, and quality features, then the value of the number of random features on each split on can be relatively small: all the considered features will be cool. If you have a lot of noisy data, then this value should probably be higher, to increase the chances of a quality feature being included in the contest.
  • Increasing the maximum number of random features considered in a split tends to decrease the bias of the model, as there is a better chance that good features will be included, however this can come at the cost of increased variance. Also, there is a decrease in training speed when we include more features to test at each node.

The most practical approach here is to cross-validate your posible options and keep the model that yields the best results, taking into account the previous considerations. You can try setting the the following values in the grid search space for the RandomForestClassifier of Scikit-learn.

  1. None : This will consider all the features of your data, taking some of the randomness out of random forests, and possibly increasing variance.
  2. sqrt : This option will take square root of the total number of features in individual each split. If we have 25 features in our data, then it will pick 5 random features at each node. This option is generally good for classification problems.
  3. 0.2 (decimal value between 0 and 1): This option allows the random forest to take a % of variables in individual split. In this example we would be picking 20% of the features, which is a reasonable amount to consider if we have many features. Try 0.3, 0.4, and 0.5, and maybe even higher values if you have very noisy data. For regression problems 0.33% is a good starting point to search around.

Conclusion: fine tuning the number of features to consider when splitting at each node is fundamental, therefore it should be considered when using a search approach to find the best hyperparameters for our forest.

The size of the Bootstrapped Dataset

Lastly, we will discuss the importance of the size of the boostrapped dataset. This is what percentage of the training data should be used to train each individual tree.

Because the observations are sampled with replacement, even if the size of the bootstrapped dataset is the same as the whole training set, both datasets will be different, so many times this parameter is left untouched and each tree is trained with a random set of observations with the same size of the initial training data.

In Sklearn this is controlled with the max_samples hyperparameter, which by default takes the size of the initial data set.

In expectation, drawing N samples with replacement from a dataset of size N will select ~2/3 unique samples from the original set, leaving 1/3 behind (what is called the out of bag or OOB data, which can then be used to evaluate the forest).

Because of these considerations, it doesn’t hurt to use the full size of the training data, so most times the best thing to do is to not touch this hyperparameter.

Conclusion and further Resources.

In this post we have seen what the most important Hyper parameters of Random Forest are, how to set their values, and which of them are worth fine-tuning.

Like any ML problem, this is all dependent on your data, resources, and goal, so if you have time, do a sparse grid search first around the recommended values for each hyper-parameter and then a second, more specific search close to the optimal values found in the previous step.

The best parameter values should always be cross-validated if there is time for it, and at least a couple of combinations should be tried. For further information take a look at the following resources:

That is it! As always, I hope you enjoyed the post. If you did feel free to follow me on Twitter at @jaimezorno. Also, you can take a look at my other posts on Data Science and Machine Learning here!

For further resources on Machine Learning and Data Science check out the following repository: How to Learn Machine Learning! For career resources (jobs, events, skill tests) go to AIgents.co — A career community for Data Scientists & Machine Learning Engineers.

Thank you very much for reading, and have a great day!