Original article was published on Artificial Intelligence on Medium
Random Forest: A Simple Overview
The purpose and function of random forest & what it looks like with SciKit Learn
Random forest (RF) is one of the several machine learning models frequently used in supervised learning. Today, we’ll be focusing on its use in classification tasks (although there is such thing as a random forest regressor).
Before we dive in, we need to gain a basic understanding of decision trees.
At a high-level, decision trees classify data.
The decision tree above determines if a person is fit or unfit based on the particular person’s features. For example, the first node splits the person based on their age. This node is called the root node, although it’s ironically at the top of the tree.
As you can imagine, this decision tree could be organized in a variety of different ways. For example, “Exercises in the morning?” could be put in the position of “Age < 30?”
However, the reason “Age < 30?” is at the top is because it is the feature that splits the data into groups that are the most different from each other, in which the members of each group are the most similar to each other. In other words, it has the lowest Gini impurity.
Gini impurity is one way to assess how well a node separates classes, or how “impure” the node is:
The formula can be seen above, in which (pi) is the probability of class i in a node. The lower the result, the better the node separates the data.
Using this formula, we can create a decision tree based on some data samples and features.
However, decision trees are only good at classifying data they have already seen before (which pretty much defeats the purpose). This is where the power of random forest comes in, combining the simplicity of decision trees with flexibility.
Random forest uses an ensemble of decision trees, therefore being an . Each decision tree is trained on a random subset of n features (with replacement). Most decision trees are thus not the same.
The most frequent prediction made by the decision trees, also called the mode or aggregate, is what a sample is classified as.
First, a bootstrapped dataset (the same size as the original) is created. To do this, we just randomly select samples from the original dataset (with replacement). Then, n random features from the dataset are selected to train the decision tree, the criterion being the Gini impurity.
This process is repeated until hundreds of decision trees are made, each using a bootstrapped dataset and random subset of features.
After the random forest is made, it can be used to classify samples it has not seen before. Whichever classification is made most often, perhaps “fit” or “unfit,” is the final classification. When we use the aggregate (most frequent classification) and bootstrap the data, it is called bagging, short for bootstrap-aggregation.
After preprocessing your data, it’s extremely easy to make a basic random forest classifier (RFC) with SciKit Learn in Python.
You can see that bootstrap is set to True and the criterion is ‘gini.’ And after fitting the RFC to the training data, it can be used to classify the test data using rfc.predict.
Then, we can see how accurately the RFC classified the test data:
Obviously, the accuracy of the model could be improved, but overall, using an RFC is very simple, and favored by many due to its simplicity and flexibility.