Machine Learning Classifiers Comparison with Python

Original article was published on Artificial Intelligence on Medium

For the following example, let’s evaluate the performance of five different classification models (i.e. logistic regression, support vector classifier, decision tree, random forest and Gaussian Naïve Bayes classifier) on the Framingham Heart Study data set (a study conducted to identify the common factors that contribute to cardiovascular diseases) to determine the one that leads to the most reliable outcomes.

The following Python code will be divided into five major steps. Lines of comments are included to provide a brief explanation and guide you through the coding process.

Step #1: Data Loading

Framingham Heart Study Top Rows

Step #2: Exploratory Data Analysis

Male/Female Ratio

According to the plot above, the Framingham Heart Study contains more data points corresponding to women than to men.

Outcome Count

The plot above reveals that the Framingham Heart Study is a heavily unbalanced data set. Most of the data points correspond to a negative class (i.e. low risk of developing a cardiovascular disease in ten years). Further data balancing will be necessary to address this issue.

Outcome Count by Gender

Interesting. Even though the total number of data points corresponding to men was lower, the plot above suggests that the risk of developing a cardiovascular disease on men is higher than on women.

Step #3: Data Cleaning

Step #4: Data Balancing

The plot above shows an equal number of classes is equal after having used the Random Under Sampling technique to balance the data set.

Step #5: Models Building and Performance Evaluation

Final Outcome

Models’ Performance Measures Scores Table

Outcome Interpretation

According to the results from the table above, the support vector classifier obtained the best accuracy, recall and F1 scores, and the second best precision score, making it the most reliable machine learning classifier for this data set. On the other hand, it can be stated that the decision tree and Gaussian Naive Bayes models had the poorest performance, and thus, are not reliable classification models for this data set.

What Comes Next?

After the support vector classifier has been identified as the most reliable machine learning classifier, the next step would consist in tuning its parameters to determine if its performance can further be improved.

It is worth to specify that during the instantiation of the machine learning classifiers in the code above, their parameters were set to the default ones, expect for the max_iter parameter in the logistic regression model to achieve model convergence and the dual parameter in the support vector classifier since the number of samples is larger than the number of features.

Now It’s Your Turn

In the previous example, the classification models were built using the entire data set (i.e. data points corresponding to men and women), which assumes that the factors that contribute to cardiovascular diseases have the same weight on both genders. As a way of practicing your coding skills, try splitting the data set by gender and built classification models for each one. Compare their performances using the evaluation metrics discussed in this article and evaluate if it results more convenient to have independent classification models for each gender or a common one for both of them.

Concluding Thoughts

Machine learning and artificial intelligence algorithms have many useful and diverse applications to solve problems and complex tasks. In addition with data science, they have become a highly popular research trend within academics and professionals with new emerging research lines in a wide range of fields. Researches continue updating and developing new programming libraries and packages for multiple programming languages and software to facilitate the implementation and execution of such algorithms.

Python represents a great free and open-source programming language capable of performing a wide range of machine learning, artificial intelligence, data science and data analytical tasks. Some of its most popular machine learning and deep learning libraries include scikit-learn, TensorFlow, Keras, PyTorch, Pandas and NLTK. Data scientist and analysts must take the best out of these tools to solve real-life complex problems and tasks to bring added value to an organization, client or research field.

— —

If you found this article useful, feel welcome to download my personal codes on GitHub. You can also email me directly at rsalaza4@binghamton.edu and find me on LinkedIn. Interested in learning more about data analytics, data science and machine learning applications in the engineering field? Explore my previous articles by visiting my Medium profile. Thanks for reading.

– Robert