Hands-on Machine Learning in Python — Decision Tree Classification

Original article was published by Christopher Tao on Artificial Intelligence on Medium


Hands-on Machine Learning in Python — Decision Tree Classification

A completed workflow of implementing a Decision Tree Model

As one of the most popular classic machine learning algorithms, the Decision Tree is much more intuitive than the others for its explainability. Today, in this article, I am going to show you the entire workflow of implementing a Decision Tree Classification Model.

A typical workflow of a machine learning task usually starts from the data wrangling, since the data we have got initially often cannot be directly used. That is so-called the raw data. The typical workflow is usually as follows.

  1. Problem Statement
  2. Exploratory Data Analysis (EDA)
  3. Data Cleansing
  4. Feature Engineering
  5. Model Training
  6. Model Evaluation

Please be noticed that the workflow might not be linear. For example, sometimes the performance of the model is not quite well after we have done the model evaluation, and we still have other ideas to try. Then, we might go back to either the step among 3–5 to apply our ideas that might improve the model.

In this article, I’ll show these steps as linear. I assume that you have already known what is a decision tree and roughly how it works. However, if not, I have written many articles to introduce the decision tree model with intuitions. Please check them out at the end of this article if you need to.

1. Problem Statement

Photo by pixel2013 on Pixabay

The Titanic survival dataset is famous in the data science domain, which is also considered as the “lesson 1” in Kaggle. Of course, many excellent solutions can predict survivals very well, and it turns out that Decision Tree is not the best solution. However, it doesn’t prevent us to use this dataset as an example to train our Decision Tree Classification model.

The dataset can be found in Kaggle, which can be downloaded for free.

Please be noticed that we are going to use the train.csv only. The test.csv is used to submit your predicted results because Kaggle is for competition purpose. The gender_submission.csv is not relevant to the machine learning model, so please ignore it.

The instructions of the dataset, including the variables, train/test sets, etc. can be found in the above link together with the downloadable dataset.

However, in practice, all the information of a given raw dataset might not be clear enough. Knowing the problem is “predicting the survival” is far from enough. It may be necessary to conduct workshops with the data owners and other stakeholders.

Trust me, clarifying a dataset is not an easy job. In most of the time, we may have to start working on a dataset without 100% understand it because it can be impossible sometimes.

2. Exploratory Data Analysis (EDA)

Photo by Julius_Silver on Pixabay

The results of EDA is not directly used in the model training, but it cannot be ignored. We need to understand the dataset in detail, in order to facilitate the data cleansing and feature selection.

Here I will show some basic and commonly used EDA.

Data Preview

Before everything, we usually want to know how the dataset looks like. However, it is not uncommon that the dataset is too large so that we can print them entirely. There is a simple function head() to preview the first 5 rows.

df_train.head()

Data Profiling

Pandas has a very convenient function for us for profiling the data frame we have. Simply call info() function of the data frame as follows.

df_train.info()

Feature Statistics

It is also very important to check the statistics of the features. Pandas data frame can also do that for us very easily.

df_train.describe()

The describe() function will automatically select numeric features and calculates their statistics for us. However, what if we also want to have a picture of those categorical or string-type features? We can use the include parameter in the describe() function and pass in a list of object types explicitly as follows.

df_train.describe(include=['O'])

Please be noticed that the type 'O' means string type. In Pandas, no variables will be read as categorical by default unless you explicitly ask it to do so. If you really have any columns that are categorical types (can be found in data profile), you need also pass in 'categorical' in the list.

EDA Outcomes

In practice, you may need to do more tasks for EDA, such as plot the features in histograms to see their distributions or get the correlation matrix and so on. This article will stop here since it meant to show you what are the type of things to do in EDA.

After the above EDA tasks, we have found some problems that need to be addressed in the Data Cleansing stage.

  • There are some missing data in the Age, Cabin and Embarked columns.
  • In those columns having missing data, Age is numeric, whereas Cabin and Embarked are categorical.
  • Age and Embarked has very few missing values, while Cabin has most of the values missed.

3. Data Cleansing

Photo by Sztrapacska74 on Pixabay

Let’s firstly look at the Age column. We can firstly double check there are data entries with age that are missing.

df_train[df_train['Age'].isna()].head()

There are many ways to fix a data gap, such as

  • Remove the entire row if there are any missing
  • Leave it as-is (for some types of machine learning algorithms, NULL values cause problems. So, this option will not be applicable)
  • Fill the gaps with mean value (only for numeric variables)
  • Fill the gaps with mode value (works for both numeric and categorical variables)
  • Customised gap-filling algorithms (can be very complex, such as using another machine learning model to predict the missing values)

In here, let’s take the relatively simple methods. That is, using mean value to fill the numeric columns, and then using the mode to fill the categorical values.

To fill the Age missing data with mean, we can do as follows.

df_train['Age'] = df_train['Age'].fillna(df_train['Age'].mean())

For the Embarked column, let’s fill it with the mode value. Here we don’t need to find the mode again, because the describe(include=['O']) has already told as 'S' has the largest frequency that is 644 out of 889 entries.

So, let’s use 'S' to fill the NULL values in the Embarked column.

df_train['Embarked'] = df_train['Embarked'].fillna('S')

Finally, for the Cabin column, there are more than 70% missing. It is definitely cannot be filled. In fact, for such kind of column, I would like to ignore it for now. In other words, we won’t use it for our model training.

4. Feature Engineering

Photo by silviarita on Pixabay

Feature Engineering is a very important step for training a machine learning model, especially for classic machine learning algorithms (not deep learning). Sometimes it makes the majority time of the whole workflow, because we may need to revisit this stage many times to improve the performance of our model.

In our dataset, firstly we need to identify some features that are not usable or not useful.

  • The PassengerId feature needs to be rejected because it is not useful
  • The Name feature should be rejected as well because it does not have any impacts on whether the passenger will survive or not.
  • The Cabin feature can be rejected because there are more than 70% missing.
  • The Ticket feature should also be rejected because it does not show any patterns in EDA. They’re just another kind of “ID”.

For the rest of the features: Pclass, Sex, Age, SibSp, Parch, Fare and Embarked, they seem to be useful so that we should select them. Because we are going to use the Decision Tree, so if there are any features that do not help the model, they will be less likely to be selected to split the tree nodes. Let the algorithm tells us.

So, let’s build our feature data frame and label series.

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']label = 'Survived'df_train_features = df_train[features]s_train_label = df_train[label]

5. Model Training

Photo by stevepb on Pixabay

Now we can train our model using the features. However, since we are going to use Sci-kit Learn library and its Decision Tree algorithms do not take categorical values in string types, we have to do One Hot Encoding to our feature dataset. Specifically, our Sex and Embarked features are of string type, which needs to be converted into numbers.

Encoding is one of the most important techniques in data preprocessing, which worth to write another article to introduce. So, we will skip the lecture of One Hot Encoding in this article. You will be able to get many excellent tutorials online if you don’t understand what it is.

First of all, let’s import the One Hot Encoder from Sci-kit Learn library.

from sklearn.preprocessing import OneHotEncoder

Initialising the encoder is very easy. After that, let’s fit the encoder with the two categorical features: Sex and Embarked.

encoder = OneHotEncoder()
encoded_arr = encoder.fit_transform(df_train_features[['Sex', 'Embarked']]).toarray()

It can be seen that we got an 891 x 5 matrix, which means that the encoded feature set has 5 columns. This is because the Sex feature has 2 distinct values: female and male, and the Embarked feature has 3 distinct values: S, C and Q. So, the total number is five.

We can get the order of the labels by calling categories_ of the encoder.

encoder.categories_

Once we have the order of the new features, we can generate a data frame from the encoded matrix using the new feature labels.

df_encoded = pd.DataFrame(encoded_arr, columns=[
'Sex=female', 'Sex=male', 'Embarked=C', 'Embarked=Q', 'Embarked=S'
]).astype(int)

The new encoded feature set makes sense. For example, the first row in the above data frame shows that the passenger is a male and embarked with S.

Now, we can concatenate the encoded feature set with the original feature data frame. Don’t forget to remove the Sex and Embarked columns from the original one because they should be replaced by the encoded new features.

df_train_features = df_train_features.drop(columns=['Sex', 'Embarked'])
df_train_features = pd.concat([df_train_features, df_encoded], axis=1)

Now, we can train our Decision Tree model.

from sklearn.tree import DecisionTreeClassifiermodel = DecisionTreeClassifier()
model.fit(df_train_features, s_train_label)

Finished!

But wait, what does our model look like? Usually, it is not quite easy to be “seen” how the model looks like for most of the machine learning algorithms. However, Decision Tree is not in those. We can visualise our model to see how the nodes are split.

import matplotlib.pyplot as plt 
from sklearn import tree
fig, _ = plt.subplots(nrows=1, ncols=1, figsize=(100,50), dpi=300)
tree.plot_tree(
model,
feature_names=df_train_features.columns,
filled=True
)
fig.savefig('tree.png')

The code above will plot the tree using Matplotlib and save the graph as an image file. It is recommended to check out the tree in the image file because it is too big to be shown in a webpage if you’re using Jupyter Notebook.

Here is a part of the tree from the “tree.png” image.

6. Model Evaluation

Photo by qimono on Pixabay

In the final stage, but might not be the finish of our workflow, we need to evaluate our model.

The best way of evaluating our model is to download the test.csv from the Kaggle page above and predict the survivals for the test dataset and then submit it. However, this is because Kaggle is a competition platform that provides such a feature. In practice, we usually need to split our raw dataset into training and testing datasets.

However, in this article, I want to use another method to evaluate our model which is the cross-validation.

The basic idea of cross-validation is to evaluate the model training methods and hyperparameters rather than a trained model. It can be considered as the follow steps.

  1. Split the dataset into n segments with equal size.
  2. Train the model using n-1 segments and the rest 1 segment will be used as the test set.
  3. Calculate the model prediction accuracy.
  4. Repeat step 2–3 to use different segment as the test set, until all of them has been evaluated.
  5. Get the mean of the n accuracy numbers, which will be considered as the score of the model.

Let’s implement it.

import numpy as np
from sklearn.model_selection import cross_val_score
accuracy_list = cross_val_score(model, df_train_features, s_train_label, cv=10)print(f'The average accuracy is {(np.mean(accuracy_list)*100).round(2)}%')

Please be noticed that the parameter cv is the number of segments n that is mentioned. It is quite common to use 10 for that.

It turns out that the result might not be ideal. In practice, we may revisit one of the previous stages to see whether we can reshape our dataset to get a better result, such as changing the gap-filling mechanisms in the Data Cleansing stage.

Summary

Photo by Larisa-K on Pixabay

Well, I would say that the average accuracy we got from the cross-validation is indeed not very ideal. You may go to the Kaggle problem webpage to find out other solutions that have been done by others, some of them has achieved very high accuracy.

In this article, I have demonstrated an entire typical workflow of a machine learning job. It starts from the problem statement, how to treat the data and so on, rather than directly jump into training a machine learning model like most of the other articles do. Although there is too much detail can be expanded such as the EDA and Feature Engineering steps, hope this article has shown the typical steps that data scientists would follow and give you a general picture.

Hope you enjoy the reading!

Other Related Works

Here are some of my previous articles that introduce Decision Tree Algorithms. Please go check them out if you are interested!

Decision Tree by ID3 Algorithm (Entropy)

Decision Tree by C4.5 Algorithm (Information Gain Ratio)

Decision Tree by CART Algorithm (Gini Index)

Decision Tree for Regression Problems