# Modelling Classification Trees

Original article was published on Artificial Intelligence on Medium

# Modelling Classification Trees

## How to program one of the most popular machine learning algorithms (Python)

Decision Trees (DTs) are one of the most popular algorithms in Machine Learning: they are easy to visualize, highly interpretable, super flexible, and can be applied to both classification and regression problems. DTs predict the value of a target variable by learning simple decision rules inferred from the data features.

In my post The Complete Guide to Decision Trees”, I describe DTs in detail: their real-life applications, different DT types and algorithms, and their pros and cons. Now it’s time to get pragmatic. How do you build a DT? And how do you apply it to real data? DTs are nothing but algorithms (or sequences of steps), which makes them perfect for programming languages. Let’s see how.

# The Problem

The World Bank assigns the world’s economies into four income groups:

• High
• Upper-middle
• Lower-middle
• Low

This assignment is based on Gross National Income (GNI) per capita calculated using the Atlas method (measured in current US Dollars), and the categories are defined as of July 1 2018. Using data pre-processing techniques, I’ve created a dataset that also includes other variables by country like population, surface, purchasing power, GDP and others. You can download the dataset under this link.

The goal of this Classification Tree is to predict the income group of a country based on the variables included in the dataset.

# The Steps

You can cut down the complexity of building DTs by dealing with simpler sub-steps: each individual sub-routine in a DT will connect to other ones to increase complexity, and this construction will let you reach more robust models that are easier to maintain and improve. Now, let’s build a Classification Tree (special type of DT) in Python.

## Load data and describe dataset

Loading a data file is the easy part. The problem (and most time-consuming part) usually refers to the data preparation process: setting the right data formats, dealing with missing values and outliers, eliminating duplicates, etc.

Before loading the data, we’ll import the necessary libraries:

`import xlrdimport pandas as pdimport numpy as npfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import train_test_split`

Now we load the dataset:

`df_c = pd.read_excel(“macrodata_class.xlsx”)`

Take a look at the data:

`df_c.head()`