Original article was published on Artificial Intelligence on Medium
R vs Python Speed Benchmark on a simple Machine Learning Pipeline
There’s a lot of recurrent discussion on the right tool to use for Machine Learning. R and Python are often considered alternatives: they are both good for Machine Learning tasks. But when a company needs to develop tools and maintain two solutions for that, this may come at a higher cost. Therefore, we sometimes have to choose.
In this article, I am presenting an R vs Python Speed Benchmark that I did to see whether Python really presents the speed improvement that some claim it has.
The Benchmarked Machine Learning Pipeline
For a benchmark, it is relatively hard to make it fair: the speed of execution may well depend on my code, or the speed of the different libraries used. I had to make a decision and I have decided to do classification on the Iris dataset. It is a relatively easy Machine Learning project, which seems to make for a fair comparison.
I will use libraries in both R and Python of which I know that they are commonly used and besides they are libraries that I like to use myself.
The steps of the benchmark
I have made two notebooks, R and Python, that both execute the following steps:
- read a csv file with the iris data.
- randomly split the data in 80% training data and 20% test data.
- fit a number of models on the training data using built-in grid-search and cross-validation methods
- evaluate each of those best models on the test data and select the best model
The models of the benchmark
I have chosen to use the following list of models: Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors, and Support Vector Machine. For the latter two, I added a grid search for hyperparameter tuning with 5-fold cross-validation using multiprocessing on 3 cores.
I have chosen those models rather than the more popular Random Forest or XGBoost, because the latter have many more parameters, and the differences between function interfaces make it harder to assure a perfectly equal set-up for the models’ executions.
The models I have chosen take fewer parameters and the ways to use them are almost the same between R and Python. There is, therefore, a smaller risk to bias the benchmark with the wrong parameter choice.
The resulting scripts and notebooks
The R Code
The following R code was used for the benchmark:
The Python Code
The following Python code was used for the benchmark:
The results: is Python faster than R?
To make a fair comparison, I have converted the complete code in a function that I execute 100 times, and then measured the time it took. Both codes were executed on a MacBook Pro with a 2.4GHz dual-core Intel Core i5 processor.
The total duration of the R Script is approximately 11 minutes and 12 seconds, being roughly 7.12 seconds per loop. The total duration of the Python Script is approximately 2 minutes and 2 seconds, being roughly 1.22 seconds per loop.
The Python code is 5.8 times faster than the R alternative!
The Python code for this particular Machine Learning Pipeline is therefore 5.8 times faster than the R alternative!
Of course, this cannot automatically be generalized for the speed of any type of project in R vs Python. Also, there may be faster alternative ways to write this code in either of the languages, but I consider both codes reasonable approaches to writing a Machine Learning notebook when focusing on functionality rather than on speed.
For me personally, the difference is more striking than I expected and I will consider it for future projects. I hope the article is useful to you as well! Thanks for reading!